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Preface 



This volume contains the papers presented at the Second International Confer- 
ence on Discovery Science (DS’99), held in Tokyo, Japan, December 6-8, 1999. 
The conference was colocated with the Tenth International Conference on Algo- 
rithmic Learning Theory (ALT’99). 

This conference was organized as part of the activities of the Discovery Sci- 
ence Project sponsored by Grant-in- Aid for Scientific Research on Priority Area 
from the Ministry of Education, Science, Sports and Culture (MESSC) of Japan. 
This is a three- year project starting from 1998 that aims to (1) develop new 
methods for knowledge discovery, (2) install network environments for knowl- 
edge discovery, and (3) establish Discovery Science as a new area of computer 
science. 

The aim of this conference is to provide an open forum for intensive discus- 
sions and interchange of new information among researchers working in the new 
area of Discovery Science. 

Topics of interest within the scope of this conference include, but are not 
limited to, the following areas: Logic for /of knowledge discovery, knowledge dis- 
covery by inferences, knowledge discovery by learning algorithms, knowledge dis- 
covery by heuristic search, scientific discovery, knowledge discovery in databases, 
data mining, knowledge discovery in network environments, inductive logic pro- 
gramming, abductive reasoning, machine learning, constructive programming as 
discovery, intelligent network agents, knowledge discovery from unstructured and 
multimedia data, statistical methods for knowledge discovery, data and knowl- 
edge visualization, knowledge discovery and human interaction, and human fac- 
tors in knowledge discovery. 

The DS’99 program committee selected 26 papers and 25 posters/demos 
from 74 submissions. Papers were selected according to their relevance to the 
conference, accuracy, significance, originality, and presentation quality. 

In addition to the contributed papers and posters, two speakers accepted our 
invitation to present talks: Stuart Russell (University of California, Berkeley), 
and Jan M. Zytkow (University of North Carolina). Three more speakers were 
invited by ALT’99: Katharina Morik (University of Dortmund), Robert Schapire 
(AT&T Shannon Lab.), and Kenji Yamanishi (NEC). Invited talks were shared 
by the two conferences. 

Continuation of the DS series is supervised by its steering committee con- 
sisting of Setsuo Arikawa (Chair, Kyushu Univ.), Yasumasa Kanada (Univ. 
of Tokyo), Akira Maruoka (Tohoku Univ.), Satoru Miyano (Univ. of Tokyo), 
Masahiko Sato (Kyoto Univ.), and Taisuke Sato (Tokyo Inst, of Tech.). 

DS’99 was chaired by Setsuo Arikawa (Kyushu Univ.), and assisted by the 
local arrangement committee: Satoru Miyano (Chair, Univ. of Tokyo), Shigeki 
Goto (Waseda Univ.), Shinichi Morishita (Univ. of Tokyo), and Ayumi Shino- 
hara (Kyushu Univ.). 




VI 



Preface 



We would like to express our immense gratitude to all the members of the 
program committee, which consisted of: 

Koichi Furukawa (Chair, Keio U., Japan) 

Peter Flach (U. Bristol, UK) 

Randy Goebel (U. Alberta, Canada) 

Ross King (U. Wales, UK) 

Yves Kodratoff (Paris-Sud, France) 

Pat Langley (Inst, for the Study of Learning & Expertise, USA) 

Nada Lavrac (Jozef Stefan Inst., Slovenia) 

Heikki Mannila (Microsoft Research, USA) 

Katharina Morik (U. Dortmund, Germany) 

Shinichi Morishita (U. Tokyo, Japan) 

Hiroshi Motoda (Osaka U., Japan) 

Stephen Muggleton (York University, UK) 

Koichi Niijima (Kyushu U., Japan) 

Toyoaki Nishida (Naist, Japan) 

Keiichi Noe (Tohoku University, Japan) 

Hiroakira Ono (Jaist, Japan) 

Claude Sammut (U. NSW, Australia) 

Carl Smith (U. Maryland, USA) 

Yuzuru Tanaka (Hokkaido U., Japan) 

Esko Ukkonen (U. Helsinki, Finland) 

Raul Valdes-Perez (CMU, USA) 

Thomas Zeugmann (Kyushu U., Japan) 
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The Melting Pot of Automated Discovery: 
Principles for a New Science 



Jan M. Zytkow 



Computer Science Department, UNC Charlotte, Charlotte, N.C. 28223 
and Institute of Computer Science, Polish Academy of Sciences 
zytkowOuncc . edu 



Abstract. After two decades of research on automated discovery, many 
principles are shaping up as a foundation of discovery science. In this 
paper we view discovery science as automation of discovery by systems 
who autonomously discover knowledge and a theory for such systems. 
We start by clarifying the notion of discovery by automated agent. Then 
we present a number of principles and discuss the ways in which different 
principles can be used together. Further augmented, a set of principles 
shall become a theory of discovery which can explain discovery systems 
and guide their construction. We make links between the principles of 
automated discovery and disciplines which have close relations with dis- 
covery science, such as natural sciences, logic, philosophy of science and 
theory of knowledge, artificial intelligence, statistics, and machine learn- 
ing. 



1 What Is a Discovery 

A person who is first to propose and justify a new piece of knowledge K is con- 
sidered the discoverer of K. Being the first means acting autonomously, without 
reliance on external authority, because there was none at the time when the 
discovery has been made, or the discovery contradicted the accepted beliefs. 

Machine discoverers are a new class of agents who should be eventually held 
to the same standards. Novelty is important, but a weaker criterion of novelty 
is useful in system construction: 

Agent A discovered knowledge iff A acquired K without the use of 
any knowledge source that knows K. 

This definition calls for cognitive autonomy of agent A. It requires only that 
K is novel to the agent, but does not have to be made for the first time in the 
human history. The emphasis on autonomy is proper in machine discovery. Even 
though agent A discovered a piece of knowledge K which has been known to 
others, we can still consider that A discovered K, if A did not know K before 
making the discovery and was not guided towards K by any external authority. 
It is relatively easy to trace the external guidance received by a machine discov- 
erer. All details of software are available for inspection, so that both the initial 
knowledge and the discovery method can be analyzed. 
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The existing systems would not reach success in making discoveries if we hu- 
mans did not provide help. But even a limited autonomy is sufficient, or else we 
would hold machine discoverers to higher standards than humans. Consider his- 
torical details of any human discovery in order to realize the amount of method, 
knowledge and data which has been provided by others. Even the most revolu- 
tionary discovery made by an individual is a small incremental step prepared by 
prior generations. 



2 Who Are Automated Discoverers 

One of the main research directions in machine discovery has been the automa- 
tion of discovery in science. Many of the recent results can be found in collections 
edited by Shrager & Langley (1990), Edwards (1993), Zytkow (1992, 1993), Si- 
mon, Valdes-Perez & Sleeman (1997), and in Proceedings of 1995 AAAI Spring 
Symposium on Systematic Methods of Scientific Discovery, 1998 ECAI Work- 
shop on Scientific Discovery, and AISB’99 Symposium on Scientific Creativity. 
Risking a slight oversimplification, research on scientific discovery can be split 
into discovery of empirical laws and discovery of hidden structure. We will use 
many systems as examples, but the size limits preclude a systematic outline of 
discovery systems and their capabilities. 



2.1 Knowledge Discovery in Databases 

Automation of scientific discovery can be contrasted with knowledge discovery in 
databases (KDD) which is a fast-growing field, driven by practical business appli- 
cations. KDD is focused on data which are fixed and collected for purposes alien 
to discovery. Both the search techniques and the results of knowledge discovery 
in databases are far more simplistic than those of automated scientific discov- 
ery. Numerous publications describe KDD research, for instance collections of 
papers, typically conference proceedings, edited by Piatetsky-Shapiro & Frawley 
(1991), Piatetsky-Shapiro (1993), Ziarko (1994), Komorowski & Zytkow (1997), 
Zytkow & Quafafou (1998), Chaudhuri & Madigan (1999) and many others. 



2.2 A Rich Brew Is Melting in the Pot 

Knowledge discovery tools have been inspired by many existing research areas: 
artificial intelligence, philosophy of science, history of science, statistics, logic 
and good knowledge of natural sciences as they provide the clearest standards of 
discovery. Each of these areas has been occupied with another aspect of discovery. 
Machine discovery uses the creative combination of knowledge and techniques 
from the contributing areas, but also adds its own extra value, which we try to 
summarize in several principles. 
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3 Principles of Autonomy 

Consider two agents Ai and A 2 , who are exactly the same, except that A 2 
possesses a few extra means useful in discovery, that apply, for instance, in data 
acquisition or in inductive generalization. A 2 is more autonomous than Ai, as 
those extra means increase A_ 2 ’s discovery capabilities. They are unavailable to 
Ai, unless provided by other agents. 

This simple analysis would be a poor philosophy when applied to humans, 
because our mental techniques are so difficult to separate, but it makes good 
sense when agents are computer systems that can be easily cloned and compared. 
We can draw our first principle: 

^1: Autonomy of an agent is increased by each new method that 
overcomes some of the agent’s limitations. 

Admittedly, each machine discoverer is only autonomous to a degree, but 
its autonomy can increase by future research. This principle leads to a program: 
identify missing functions of discovery systems and develop methods that supply 
those functions. 

The mere accumulation of new components, however, is not very effective. 
Each new component may have been designed to provide a particular missing 
function, but after it is completed, it may be used in new creative ways, in 
combination with the existing methods. Such combinations multiply the overall 
discovery capability. As a result of integration, more discovery steps in succession 
can be performed without external help, leading to greater autonomy: 

A2: Autonomy of an agent is increased by method integration, 
when new combinations of methods are introduced. 

The BACON program (Langley, Simon, Bradshaw & Zytkow, 1987) was the 
first to follow these two principles. It started from BACON. 1 which is a heuris- 
tic equation finder, followed by BACON. 3 which generates data and applies 
recursively BACON. 1, then by BACON. 4 which converts nominal variables to 
numerical, and uses both BACON. 3 and BACON. 1 to generate numerical laws. 
Finally, BACON. 5 augments earlier BACONs with the reasoning based on sym- 
metry and conservation. Similar programs led to the increased discovery capa- 
bilities of FAHRENHEIT (Zytkow, 1996), IDS (Nordhausen & Langley, 1993), BR 
(Kocabas, 1991; Kocabas & Langley, 1995), and MECHEM (Valdes-Perez, 1993, 
1994). 

Many methods use data to generate knowledge. When applied in sequence, 
elements of knowledge generated at the previous step become data for the next 
step. This perspective on knowledge as data for the next step towards an im- 
proved knowledge is important for integration of many methods: 

A3: Each piece of discovered knowledge can be used as data for 
another step towards discovery: 

Step-1 Step-2 

Data-1 > Knowledge-1 = Data-2 > Knowledge-2 = Data-3 
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A single step rarely permits a decisive evaluation of results. A combination of 
steps provides a more informed evaluation, that is extra reasons for acceptance of 
an alternative generated at step 1. For instance, several equations of comparable 
simplicity can often fit the same data within acceptable accuracy. Further steps 
can help in making choices among the competing equations. Some equations may 
provide a better generalization to a new variable or lead to a broader scope in 
result of boundary search (FAHRENHEIT; Zytkow, 1996): 

A4: Autonomous choice is improved by evaluation that is back- 
propagated to results reached at earlier steps. 

Principles A1-A4 guide some of scientific discovery systems, but are still a 
remote ideal in KDD, dominated by simple algorithms and human guidance. 



4 Theory of Knowledge 

Knowledge of external world goes beyond data, even if data are the primary 
source of knowledge. It is important to see beyond formal patterns and under- 
stand elements of the formalism in relation to elements of the external world. 
Consider a fairly broad representation of a regularity (law, generalization) : 
Pattern (relationship) P holds in the range P of situations. 

In practical applications this schema can be narrowed down in many ways, 
for instance: 

(1) if Pi(Ai)&...&Pfc(Afc) then Rel{A,B) 

where A, B, Ai, Aj. are attributes that describe each in a class of objects, while 
Pi, ...,Pk are predicates, such as Ai > 0 or A 2 = a. An even simpler schema: 

(2) if Pi(Ai)&...&Pfc(Afc) then C = c 

covers all rules sought as concept definitions in machine learning. 

A good fit to data is important, but discoverer should know objects described 
in data and understand predicates and constants that occur in a generalization. 
Only then can knowledge be applied to other similar situations. 

/Cl: Seek objective knowledge about the real world, not knowledge 
about data. 

This principle contrasts with a common KDD practice, when researchers 
focus entirely on data. Sometimes specific knowledge about data is important: 
which data are wrong, what data encoding schemas were used. 

An example of disregard for objective knowledge is a new popular mecha- 
nisms for concept learning from examples, called bagging. Consider a system 
that produces decision trees. When provided with different samples of data it 
will typically generate different trees. Rather than analyzing commonalities in 
those trees in order to capture objective relations in data, all the trees are used 
jointly to predict values of the class attribute. If you wait a few minutes you will 
get another dozen of trees. Many parameter values in such trees are incidental, 
but the method has no concern for objective knowledge. 
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Schemas such as (1) or (2) define vast, sometimes infinite, hypothesis spaces. 
Hypotheses are generated, often piece by piece, evaluated and retained or elim- 
inated. 

/C2: [Principle of knowledge construction] All elements of each piece 
of knowledge are constructed and evaluated by a discovery system. 

Schemas such as (2) define hypothesis spaces used by many systems. Accord- 
ing to K.2 the same construction steps must be made by each system, when they 
operate in the same hypothesis space. If the construction mechanisms were made 
clear, it would be easy to compare them. 

Likewise, different systems use the same or very similar evaluation measures: 

— accuracy: how close are predictions to actual data; 

— support: what percentage of cases is covered by RANGE and by PATTERN. 
Other measures are concerned with probabilistic component of knowledge: 

— predictive strength: degree of determinism of predictions; 

— significance: how probable is that data could have been generated from a 

given statistical distribution. 

Predictions are essential for hypothesis evaluation. It is doubtful that we 
would consider a particular statement a piece of knowledge about external world 
if it would not enable empirically verifiable predictions. 

/C3: A common characteristic of knowledge is its empirical contents, 
that is such conclusions which are empirically verifiable predictions. 

Knowledge improvement can be measured by the increased empirical con- 
tents. Logical inference is used in order to draw empirically verifiable conclu- 
sions. The premises are typically general statements of laws and some known 
facts, while conclusions are statements which predict new facts. 

When we examine carefully the results of clustering, the way they are typ- 
ically expressed, we do not see empirical contents. Exhaustive partitioning of 
data space leads no room for empirical contents. 

Empirical contents can occurs in regularities (laws, statements, sentences), 
not in predicates which may or may not be satisfied. Concepts, understood as 
predicates, have no empirical contents. 

/C4: Each concept is an investment; it can be justified by regularities 
it allows to express. 

We can define huge numbers of concepts, but such activity does not provide 
knowledge. The whole universe of knowledge goes beyond concept definitions. 

Many knowledge pieces can be expressed in a simple form of classification 
rules, association rules, contingency tables, equations, neural networks, logical 
statements and decision trees. But when the process of discovery continues, 
producing large numbers of such pieces, their management becomes a major 
problem. Some discovery systems organize large numbers of simple pieces into a 
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global graph representation. Links in a graph are used to represent relationships 
between pieces of knowledge, while frame-like structures represent knowledge 
contained in individual nodes in the graphs. 

/C5: Knowledge integration into graphs provides fast, directed ac- 
cess to individual pieces of knowledge. 

System such as DIDO (Scott and Markovitch, 1993) FAHRENHEIT, IDS, 
LIVE (Shen, 1993) use tools for constructing, maintaining, and analyzing the 
network of knowledge emerging in the discovery process. EAHRENHEIT’s knowl- 
edge graph (Zytkow, 1996) allows the system to examine any given state of 
knowledge and seek new goals that represent limitations of knowledge in the 
graph. 

For instance, when an equation E has been found for a sequence of data, 
new alternative goals are to find the limits of if’s application or to generalize E 
to another control variable. Generalization, in turn, can be done by recursively 
invoking the goals of data collection and equation fitting (BACON. 3: Langley 
et.al. 1987; and EAHRENHEIT). 



5 Principles of Search 

Discoverers explore the unknown. They examine many possibilities which can 
be seen as dead ends from the perspective of the eventually accepted solutions, 
because they do not become components of the accepted solutions. This process 
is called search. We can conclude that: 

51: If you do not search, you do not discover. 



5.1 Search Spaces 

The search space, also called problem space or state space, has been introduced 
in artificial intelligence as a conceptual tool to enable a theoretical treatment of 
the search process (Simon, 1979). 

A search space is defined by a set of states S and a two argument relation 
M C SxS, called a move relation. M contains all direct state to state transitions. 
The aims of search are represented by the evaluation function E : S TZx ...xTZ. 

In practice, the states are not given in advance, because search spaces are 
very large, often infinite. States are constructed by search operators from the 
existing states. Operators are algorithms which implement the relation M , by 
acts of construction of new states from the existing states, that is from states 
which have been earlier constructed. 

In applications to discovery, states have the meaning of tentative knowledge 
pieces or hypotheses. They are implemented as instances of datastructures that 
represent different possible pieces of knowledge. The move relation represents in- 
cremental construction of knowledge elements. The evaluation function provides 
several metrics that apply to hypotheses. In summary: 
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52: Make the search design simple and explicit. Define datastruc- 
tures that represent tentative pieces of knowledge and operations on 
those pieces. Define metrics on pieces of knowledge 

Search design must be flexible enough so that it can be adjusted to a variety 
of problems, data and computational resources. Yet it must be simple enough so 
that search properties are understood, problems fixed and the scope of search 
modified when needed. 

5.2 Discovery as Problem Solving Search 

A simple search problem can be defined by a set of initial states and a set of goal 
states. The task is to And a trajectory from an initial state to a goal state. In the 
domain of discovery the goal states are not known in advance. They are typically 
defined by threshold values of tests such as accuracy and statistical significance. 
Goal states exceed those thresholds. Without reaching the threshold, even the 
best state reached in the discovery process can be insufficient. 

53: [Herbert Simon 1] Discovery is problem solving. Each problem 
is defined by the initial state of knowledge, including data and by 
the goals. Solutions are generated by search mechanisms aimed at the 
goals. 

The initial state can be a set of data, while a goal state may be an equation 
that fits those data (BACON. 1 and other equation finders). The search proceeds 
by construction of terms, by their combinations into equations, by generation of 
numerical parameters in equations and by evaluation of completed equations. 

While new goals can be defined by pieces of knowledge missing in a knowledge 
graph, plans to accomplish those goals are different search mechanisms. The same 
goal can be carried by various plans. For instance, a variety of equation finders 
represent different plans at reaching the same or a very similar goal: BACON. 1, 
COPER (Kokar, 1986), FAHRENHEIT, IDS (Nordhausen and Langley, 1993), 
KEPLER (Wu and Wang, 1989), Dzeroski and Todorovski (1993). 

Goals and plans can be called recursively, until plans are reached which 
can be carried out directly, without reference to other goals and plans. Some 
equation finding systems use complex goals decomposed into similar subgoals, 
and repeatedly apply the same search mechanisms to different problems. They 
design many experiments, collect sequences of data and search for equations that 
fit those data (BACON.3, FAHRENHEIT, SDS Washio & Motoda, 1997). 

Search spaces should be sufficiently large, to provide solutions for many prob- 
lems. But simply enlarging the search space does not make an agent more cre- 
ative. It is easy to implement a program that enumerates all strings of characters. 
If enough time was available, it would produce all books, all data structures, all 
computer programs. But it produces a negligible proportion of valuable results 
and it cannot tell which are those valuable results. 

54: [Herbert Simon 2] A heuristic and data-driven search is an effi- 
cient and effective discovery tool. Data are transformed into plausible 
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pieces of solutions. Partial solutions are evaluated and used to guide 
the search. 

Still, the search may fail or take too much time and a discoverer should be 
able to change the goal and continue. 

55: [Recovery from failure] Each discovery step may fail and cog- 
nitive autonomy requires methods that recognize failure and decide 
on the next goal 

For instance, if an equation which would fit the data cannot be found, those 
data can be decomposed into smaller fragments and the equation finding goal 
can be set for each fragment separately. If no regularity can be found, eventually 
data can be treated as a lookup table. 

5.3 Search Control 

Search states can be generated in different orders. Search control, which handles 
the search at run-time, is an important discovery tool. Some of the principles 
that guide the design of search control are very broad and well known: 

— decompose the goal into independent subgoals; 

— evaluate partial results as early as possible; 

— search is directed by the evaluation function; do not expect that the results 

satisfy a non-applied metric; 

We can propose several principles specific to discovery. 

56: [Simple-first] Order hypotheses by simplicity layers; try simpler 
hypotheses before more complex. 

The implementation is easy, since simpler hypotheses are constructed before 
more complex. Also, simpler hypotheses are usually more general, so they are 
tried before more complex, that is more specific hypotheses. Thus if a simple 
hypothesis is sufficient, there is no need to make it more complex. 

57: [Dendral] Make search non-redundant and exhaustive within 
each simplicity layer. 

Do not create the same hypothesis twice, but do not miss any. The imple- 
mentation may not be easy, but the principle is important. If there are many 
alternative solutions of comparable simplicity, what are the reasons to claim that 
one of them is true. Each solution is questionable, because they make mutually 
inconsistent claims. Typically, alternative solutions indicate the need for more 
data or further evaluation. 

In set-theoretic terms, 5 and M together form a directed graph. But practi- 
cally, states are constructed from other states rather than retrieved from memory. 
Therefore a search tree rather than a graph represents the history of search. An 
isomorphic hypothesis can be generated at different branches of the search tree. 
If that happens, typically the same hypothesis is generated a very large number 
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of times. Each time the hypothesis can be expanded to a potentially large search 
subtree, leading to massive redundant search. Testing for isomorphism is com- 
plex, so that it may not be useful. Isomorph-free graph generation is preferred, 
when it can be arranged (Dendral, GELLMANN, Zytkow 1992). 

6 Beyond Simple-Minded Tools 

While equations are often an adequate generalization tool for scientific data, in 
KDD applications the situation is more complex. Knowledge in many forms can 
be derived from data and it is not known which is the best form of knowledge 
for a given data set. The vast majority of data mining is performed with the 
use of single-minded tools. Those tools miss discovery opportunities if results 
do not belong to a particular hypothesis space. Further, they rarely consider 
the question whether the best fit hypothesis is good enough to be accepted and 
whether other forms of knowledge are more suitable for a given case. To improve 
the dominant practice, we should use the following principle: 

Ol: [Open-mindness] Knowledge should be discovered in the form 
that reflects the real-world relationships, not one or another tool at 
hand. 

It may be unnecessarily complex, however, to search for many forms of knowl- 
edge and then retain those best justified by data. The methodology applied in 
49er (Zembowicz & Zytkow, 1996) uses simple forms of knowledge to guide search 
for more complex and specialized forms: 

1. Use a single method to start data mining. Search for contingency tables. 

2. Detect different forms of regularities in contingency tables by specialized 
tests. 

3. Search for specialized knowledge in separate hypothesis spaces. 

4. Combine many regularities of one type into specialized theories. For instance, 
create taxonomies, inclusion graphs, systems of equations. 

Different forms of knowledge require search in different hypothesis spaces. 
But if data do not fit any hypothesis in a given space, much time can be saved 
if that space is not searched at all. This problem can be solved in step 2 above, 
by demonstrating non-existence of solutions in particular spaces. 



7 Statistics 

While many methods of statistics are still unknown to the discovery community, 
many textbook solutions have been implemented. Statistics adds a probabilis- 
tic component to deterministic regularities handled by the languages of logic, 
algebra, and the like. 

So called random variables represent probabilistic distributions of measure- 
ments, which are due to error and to variability within a population. 
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Equations and other forms of deterministic knowledge can be augmented with 
statistical distributions, for instance, y = f{x) + N{0,a{x)). N{0,a{x)) repre- 
sents Gaussian distribution of error, with mean value equal zero and standard 
deviation cr(a;). 

Statistics offers methods which estimate parameter values of a distribution 
(population) from a sample, analyze bias of estimators, derive sampling distribu- 
tions, then significance tests and confidence intervals for estimated parameters. 

Most often a particular distribution is assumed rather than derived from 
data, because traditional statistical data mining operated on small samples and 
used visualization tools to stimulate human judgement. Currently, when large 
datasets are abundant and more data can be easily generated in automated 
experiments, we can argue for verification of assumptions: 

ST ATI: Do not make assumptions and do not leave unverified as- 
sumptions. 

For instance, when using the model y = f{x) + N(0,a(x)) verify Gaussian 
distribution of residua, with the use of runs test and other tests of normality. 
Publications in statistics notoriously start from “Let us assume that ...” Either 
use data to verify the assumptions, and when this is not possible, ask what is 
the risk or cost when the assumptions are not met. 

Another area which requires revision of traditional statistical thinking is test- 
ing hypothesis significance. Statistics asks how many real regularities are we 
willing to disregard (error of omission) and how many spurious regularities are 
we willing to accept (error of admission). In a given dataset, weak regulari- 
ties cannot be distinguished from patterns that come from random distribution. 
Consider search in random data: when 100,000 hypotheses are examined, 1000 
of them should be accepted when the significance threshold is the standard 1%. 
To minimize the number of such pseudo-regularities, we should set a demanding 
threshold of acceptance. 0.01% = 0.0001 would pass 10 spurious regularities. 
However, by chosing a demanding acceptance threshold we risk ignoring reg- 
ularities which are real but weak. The problem arises in automated discovery 
because they search massive hypothesis spaces with the use of statistical tests 
which occasionally mistake a random fluctuation for a genuine regularity. 

STAT2: [Significance 1] Chose a significance threshold that enables 
middle ground between spurious regularities and weak but real regu- 
larities specific to a given hypothesis space. 

While a significance threshold should admit a small percent of spurious reg- 
ularities, it is sometimes difficult to compute the right threshold for a given 
search. Statistical significance thresholds depend on the number of independent 
hypotheses and independent tests. When those numbers are difficult to estimate, 
experiments on random data can be helpful. We know that those data contain 
no regularities, so all detected regularities are spurious and should be rejected 
by the test of significance. 

STATS: [Significance 2] Use random data to determine the right 
values of significance thresholds for a given search mechanism. 
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The significance dilemma for a given regularity can be solved by acquisition 
of additional data. 

8 Conclusions 

In this paper we focused on selected principles that guide research in machine 
discovery. They are related to areas such as artificial intelligence, logic and phi- 
losophy of science, and statistics, but they are specific to machine discovery. 
We discussed a few ways in which these principles interact and some of the 
ways in which they affect research on discovery, in particular discovery systems 
construction. 
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Abstract. The paper is a brief summary of an invited talk given at the 
Discovery Science conference. The principal points are as follows; hrst, 
that probability theory forms the basis for connecting hypotheses and 
data; second, that the expressive power of the probability models used 
in scientific theory formation has expanded significantly; and finally, that 
still further expansion is required to tackle many problems of interest. 
This further expansion should combine probability theory with the ex- 
pressive power of first-order logical languages. The paper sketches an 
approximate inference method for representation systems of this kind. 



1 Data, Hypotheses, and Probability 

Classical philosophers of science have proposed a “deductive-nomological” ap- 
proach in which observations are the logical consequence of hypotheses that 
explain them. In practice, of course, observations are subject to noise. In the 
simplest case, one attributes the noise to random perturbations within the mea- 
suring process. For example, the standard least-squares procedure for fitting lin- 
ear models to data effectively assumes constant-variance Gaussian noise applied 
to each datum independently. 

In more complex situations, uncertainty enters the hypotheses themselves. 
In Mendelian genetics, for instance, characters are inherited through a random 
process; experimental data quantify, but do not eliminate, the uncertainty in 
the process. Probability theory provides the mathematical basis relating data to 
hypotheses when uncertainty is present. Given a set of data D, a set of possible 
hypotheses Ti, and a question X, the predicted answer according to the full 
Bayesian approach is 

P{X\D) = aJ2 P{X\H)P{D\H)P{H) 

H&n 

where a is a normalizing constant, P{X\H) is the prediction of each hypothesis 
H, P{D\H) is the likelihood of the data given the hypothesis H and therefore 
incorporates the measurement process, and P{H) is the prior probability of H . 
Because TC may be large, it is not always possible to calculate exact Bayesian 
predictions. Recently, Markov chain Monte Garlo (MGMG) methods have shown 
great promise for approximate Bayesian calculations |^, and will be discussed 
further below. In other cases, a single maximum a posteriori (MAP) hypothesis 
can be selected by maximizing P{D\H)P{H). 
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2 Expressive Power 

Probability models (hypotheses) typically specify probability distributions over 
“events.” Roughly speaking, a model family £2 is at least as expressive as L\ iff, 
for every ei in £ 1 , there is some in £2 such that ei and 62 denote “the same” 
distribution. Crude expressive power in this sense is obviously important, but 
more important is ejficient representation: we prefer desirable distributions to 
be represented by subfamilies with “fewer parameters,” enabling faster learning. 

The simplest probability model is the “atomic” distribution that specifies 
a probability explicitly for each event. Bayesian networks, generalized linear 
models, and, to some extent, neural networks provide compact descriptions of 
multivariate distributions. These models are all analogues of propositional logic 
representations . 

Additional expressive power is needed to represent temporal processes: hid- 
den Markov models (HMMs) are atomic temporal models, whereas Kalman fil- 
ters and dynamic Bayesian networks (DBNs) are multivariate temporal models. 
The efficiency advantage of multivariate over atomic models is clearly seen with 
DBNs and HMMs: while the two families are equally expressive, a DBN repre- 
senting a sparse process with n bits of state information requires 0{n) parameters 
whereas the equivalent HMM requires 0(2"). This appears to result in greater 
statistical efficiency and hence better performance for DBNs in speech recogni- 
tion m- DBNs also seem promising for biological modelling in areas such as 
oncogenesis 0 and genetic networks. 

In areas where still more complex models are required, there has been less 
progress in creating general-purpose representation languages. Estimating the 
amount of dark matter in the galaxy by counting rare MACHO observations 0 
required a detailed statistical model of matter distribution, observation regime, 
and instrument efficiency; no general tools are available that can handle this kind 
of aggregate modelling. Research on phylogenetic trees 0 and genetic linkage 
analysis uses models with repeating parameter patterns and different tree 
structures for each data set; again, only special-purpose algorithms are used. 
What is missing from standard tools is the ability to handle distributions over at- 
tributes and relations of multiple objects — the province of first-order languages. 

3 First-Order Languages 

First-order probabilistic logics (FOPL) knowledge bases specify probability dis- 
tributions over events that are models of the first-order language 0 . The ability 
to handle objects and relations gives such languages a huge advantage in repre- 
sentational efficiency in certain situations: for example, the rules of chess can be 
written in about one page of Prolog but require perhaps millions of pages in a 
propositional language. Recent work by Roller and Pfeffer |j^ has succeeded in 
devising restricted sublanguages that make it relatively easy to specify complete 
distributions over events (see also 0). Current inference methods, however, are 
somewhat restricted in their ability to handle uncertainty in the relational struc- 
ture of the domain or in the number of objects it contains. It is possible that 
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algorithms based on MCMC will yield more flexibility. The basic idea is to con- 
struct a Markov chain on the set of logical models of the FOPL theory. In each 
model, any query can be answered simply; the MCMC method samples models 
so that estimates of the query probability converge to the true posterior. For 
example, suppose we have the following simple knowledge base: 

yx,y P{CoveredBy{x,y)) = 0.3 

~ix P(Safe{x) I 3y CoveredBy{x,y)) = 0.8 

Va; P{Safe{x) \ -<3y CoveredBy{x,y)) = 0.5 

and we want to know the probability that everyone who is safe is covered only 
by safe people, given certain data on objects a, 6, c, d: 

P(Vx Safe{x)^{yy CoveredBy{x, y)^ Safe{y)) \ Safe{a), Safe{b), -iSafe{d)) 
Figure Q shows the results of running MCMC for this query. 
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Fig. 1. Solution of a first-order query by MCMC, showing the average probability 
over 10 runs as a function of sample size {solid line) 



Much work remains to be done, including convergence proofs for MCMC on 
infinite model spaces, analyses of the complexity of approximate inference for 
various sublanguages, and incremental query answering between models. The 
approach has much in common with the general pattern theory of Grenander m 
but should benefit from the vast research base dealing with representation, in- 
ference, and learning in first-order logic. 
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Abstract. We consider the classification problem of how to predict the 
values of a categorical attribute of interest using the other numerical 
attributes in a given set of tuples. Decision by voting such as bagging 
and boosting attempts to enhance the existing classification techniques 
like decision trees by using a majority decision among them. However, 
a high accuracy ratio of prediction sometimes requires complicated pre- 
dictors, and makes it hard to understand the simple laws affecting the 
values of the attribute of interest. We instead consider another approach 
of using of at most several fairly simple voters that can compete with 
complex prediction tools. We pursue this idea to handle numeric datasets 
and employ region splitting rules as relatively simple voters. The results 
of empirical tests show that the accuracy of decision by several voters 
is comparable to that of decision trees, and the computational cost is 
inexpensive. 



1 Introduction 

1.1 Motivating Example 

Large Decision Tree. Fig.H\ shows a sample decision tree generated by the 
See5/C5.0 program m with the default parameters. The 900-tuple training 
dataset is randomly selected from the german-credit (24 numerical) dataset of 
1000 tuples, which was obtained from the UCI Machine Learning Repository 
This tree contains 87 leaf nodes, and its prediction error ratio for the remaining 
100 tuples of the dataset is 27.0%. However, as shown by the number of the 
leaf nodes and the nested structure of the tree, this decision tree is relatively 
difficult to interpret and thus not always suitable for representation of scientific 
knowledge. 

Region Rules for Readability. To address the present classification problem, we 
used a weighted majority decision among region rules in which the component 
rules were relatively powerful and visually understandable. When we consider 
a pair of numeric attributes as a two-dimensional attribute, we can compute 
a line partition of the corresponding two-dimensional space so that it properly 

* Currently with the National Police Agency 
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I A14>1 _ 

A9<=3: 

I A1>1: 

|___A24<=0- 
A24>0: 

I A7>2 

k7< = z: 
l___A16>0— i-2 
A16<=0: 
l___A2< = 10^2 
, A2>10— <•! 

A1< = 1: 

I A20>0: 

1___A4>16— <•! 

A4< = 16: 

L__A9>2— <-2 
A9<=2: 
L__A24<=0^1 
, A24>0— <-2 

A20<=0: 
l___A4>59— <-2 
A4<=59: 

I___A13>1— <•! 

A13< = 1: 
l___A9>2: 

|___A4<=26— ► 
I A4>25— »1 

A9< = 2: 
l___A7>2^1 
A7<=2: 

I A2<=! 

A2>9- 



A16<=0: 

I A5>3: 

■___A1>1— <•! 
A1< = 1: 

I A12< = 1 

, A12>1— < 

A5<=3: 
|___A2>42: 

|___A6< = 1- 
A6>1 



A2<=42: 
l___A18<=0— <•! 

A18>0: 

l___A17>0: 

|___A10<=28- 



I a: 



L__A13>1: 

|___Al< = l-*2 
A1>1: 

l___A10<=34— *1 
, A10>34— *2 

A13< = 1: 

L__A6<=3: 

|___A3>2— <-2 
A3<=2: 
l___A10<=31- 
, A10>31— <-2 

Ae>3: 

L__A1>1— »1 
A1< = 1: 
l___A2<=27^ 
A2>27— <-2 



(A) An example of a large decision tree generated by the See5/C5.0 program. The 
training dataset is 900 tuples of the german-credit dataset. This tree contains 87 leaf 
nodes. The prediction error ratio for the remaining 100 tuples of the dataset was 
27.0%. Ai denotes the ith attribute. 




(B) Region rules for the german-credit dataset. Used training dataset (900 tuples) 
and test data (100 tuples) were the same as for the above decision tree. The prediction 
error ratio of weighted majority decision among these five rules was 27.0%. Ai denotes 
the ith attribute. 



Fig. 1. Decision tree and region rules. 
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classifies the tuples according to a judgment whether or not a tuple is inside the 
line partition (region). We call this type of classifier a region rule. For example, 
the five region rules presented in Fig. □b were constructed using the same training 
dataset as for the decision tree in Fig. and the achieved prediction error ratio 
(27.0%) was nearly identical to that for the decision tree. 



1.2 Decision Tree and Entropy of a Splitting 

Decision Tree. Let us consider the attributes of tuples in a database. An at- 
tribute is called Boolean if its range is {0, 1}, categorical if its range is a discrete 
set {1,. . . ,k} for some natural number k, and numeric if its range is the set of 
real numbers. Each data tuple t has to -|- 1 attributes Ai for f = 0, 1, . . . , to. We 
treat one Boolean attribute as special, denote it by W, and call it the objective 
attribute. The other attributes are called conditional attributes. 

The decision tree problem is as follows: A set U of tuples is called positive 
(resp. negative) if for a tuple t, the probability that its objective attribute is 
1 (resp. 0) is at least 0i (resp. 02 ) in U, for given thresholds 6i and 02- We 
would like to classify the set of tuples into positive subsets and negative subsets 
by using tests with conditional attributes. Let us consider a rooted binary tree, 
each of whose internal nodes is associated with a test that has attributes. We 
associate each leaf node with the subset of tuples satisfying all tests on the path 
from the root to the leaf. Every leaf is labeled as either positive or negative 
on the basis of the class distribution in the subset associated with it. Such a 
tree-based classifier is called a decision tree. 

Entropy of a Splitting. First, we define the entropy of a dataset S with respect 
to the objective attribute. Assume that the dataset S contains n tuples. To 
formalize our definition of entropy, we consider a more general case in which the 
objective attribute W is a categorical attribute taking values in {1,2,. . . , fc}. Let 
Pj be the relative frequency with which W takes the value j in the dataset S, 
the entropy of the dataset S with respect to the objective attribute W is defined 
as: 

Ent(S) = logpj. (1) 

j=l,...,k 

Using the definition of the entropy of the dataset S, the entropy of a splitting 
is defined as follows. Here, let us consider a splitting of the dataset S into two 
subsets, 5'i and S 2 with n\ and ri 2 data, respectively. The entropy of the splitting 
is defined by 

Ent{Si; S 2 ) = —Ent{Si) + —Ent{S 2 ). (2) 

n n 

To reduce the size (the number of nodes) of a decision tree Quinlan introduced a 
heuristics using a criterion called the gain ratio m- The gain ratio is calculated 
using the values of entropy defined as above, and expresses the information 
generated by a splitting. The heuristics greedily constructs a decision tree in 
a top-down, breadth-first manner according to the gain ratio. At each internal 
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node, the heuristics examines all the candidate tests, and chooses the one for 
which the associated splitting of the set of tuples attains the maximum gain 
ratio. 

1.3 Improvement of Prediction Accuracy by Voting 

Decision by Voting. To improve the power of existing classifiers, we can employ 
a strategy called decision by voting. This approach includes techniques called 
bagging 0 and boosting |PI I oj which are used in the area of machine learning. As 
the term decision by voting suggests, these techniques make the final judgment 
by a majority decision among component tests (voters). Systems that perform 
decision by voting are called ensemble classifiers or combined classifiers and 
attempt to render the total prediction accuracy of voters in the aggregate better 
than that for each of the voters alone. For example, we can use a decision tree 
as a voter. 

Bagging iterates voter generation using a set of tuples which are sampled with 
replacement from the original training dataset. Due to the randomness in the 
sampling of tuples, each voter can have different characteristics in prediction and 
the majority decision works among them. Boosting maintains weights of tuples in 
the training dataset. Each iteration generates a voter using the weighted tuples 
as the training dataset and updates the weights of the tuples to force the next 
voter generation focus on the mis-predicted tuples. Thus, we can prepare a set of 
voters with different characteristics and define the weight of each voter according 
to its prediction accuracy in the training dataset. 

Previous experimental results have shown that bagging and boosting work 
well even with relatively small numbers of voters, and that the contribution of 
an additional voter to reduction of the prediction error ratio decreases as the 
number of voters increases rrmrm . 

1.4 Rule Readability towards Scientific Discovery 

Simplicity of Rules. Accuracy is obviously a requisite element for prediction of 
attribute values. However, a high accuracy ratio of prediction sometimes requires 
a large-sized predictor. For example, the size of the whole decision tree can grow 
up to more than a few hundred nodes. If we use a majority decision among 
a set of large decision trees this is a great disadvantage with respect to rule 
readability. In some cases, especially from the viewpoint of knowledge discovery, 
we want to understand the simple laws affecting the predictor. For instance, in 
the bio-medical area, a predictor may make a decision as to whether or not a 
person is genetically affected by a particular level of gene expression in a cell. In 
this situation it is also important to understand the rules used in the predictor. 
However, a large-sized predictor may be accurate and still not always readable. 

Reduction of the size of a predictor is indispensable to meet the requirement 
for readability. Focusing on the size of a decision tree, for example, Morimoto et 
al. 13 reduced the size of a decision tree by pruning ineffective internal nodes and 
by using a relatively powerful test on each internal node that utilized a region 
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rule instead of simple inequalities. As previously described, a region rule classifies 
tuples according to a judgment whether or not a tuples has particular values for 
two attributes. On each plane spanned by an arbitrary pair of the conditional 
attributes, the optimal region is computed. The two axes for a region rule on 
an internal node are selected so that the entropy of a splitting on the node is 
minimized. A region is called a rectilinear convex region if both its intersection 
with any vertical line and its intersection with any horizontal line are always 
undivided. Among the types of two-dimensional regions, the use of rectilinear 
convex regions on internal nodes has been reported to show good performance 
rmi for the several datasets of the UCI Machine Learning Repository 

1.5 Weighted Majority among Relatively Powerful Voters 

Based on the above considerations, a natural direction of constructing a simple 
predictor is to carry out a weighted majority decision among a small number 
(e.g., less than ten) of relatively powerful but readable rules, such as region rules. 

In the remainder of this article, we introduce a simple method of obtaining a 
weighted majority decision among a set of region rules and empirically evaluate 
the accuracy and the size of predictors using the example numerical datasets 
with a Boolean or categorical objective attribute from the UCI Machine Learning 
Repository. 



2 Method — Decision by Majority 

Construction and Choice of Voters. First, we consider the case where the ob- 
jective attribute is Boolean. Let W be the objective attribute with Boolean 
values and A = {Ai, .. . ,Am} be the set of numeric conditional attributes. We 
selected the voters from candidate region rules according to the entropy of the 
splitting in the training dataset. We have mC 2 planes spanned by the pairs of 
conditional attributes, and calculate the optimum rectilinear convex region (i.e., 
one that minimizes the entropy) on each plane. This calculation can be carried 
out in practice using the polynomial-time algorithm developed by Yoda et al. 
m- These region rules are used as the candidates of voters. The obtained mC 2 
regions (voter candidates) are sorted in increasing order of the entropy of the 
splitting and the fc-best region rules are used as the voters (fc is a given constant 
number), with Ri denoting the region used in the i-th best region rule. 

Weight of Voters and Prediction. Selected region rules (voters) vote on whether 
or not the objective attribute of tuple t is 1 . The weight of each voter is expressed 
as a value between 0 and 1 . When the summation of these weights exceeds half 
of the number of the voters, they predict the objective attribute of tuple t is 1. 
We calculated the weight of a voter using the prediction accuracy in the training 
dataset. For the set of tuples which are inside (resp. outside) the region R, the 
ratio of the tuples whose objective attribute is 1 is calculated, and p{R) (resp. 
p{R)) denotes this ratio. When we can assume that the distribution of attribute 
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values is the same between the training dataset and the test dataset, we can 
evaluate the weight of a voter using those ratios. For a given tuple t, during the 
process of majority decision, the weight of the *-th voter using Ri is defined as 
follows: 



f p{Ri) if t e 
\p{Ri) if t ^ 



( 3 ) 



If the tuple t is inside the region Ri , the upper definition is used to calculate the 
weight of a voter. In the same way, if the tuple t is outside the region Ri we can 
calculate the weight of a voter using the lower definition. The latter is based on 
the fact that the objective attribute of a tuple outside the region is predicted to 
be 0 at possibility l—p(Ri) and it equivalently means that its objective attribute 
is 1 at possibility p{Ri). 

The summation of the weights of the k voters, g{t) = 'Yl!i=i indicates 

the degree of possibility that the objective attribute of tuple t is 1. If the value 
of g{t) is greater than a threshold fc/2, the majority decision predicts that the 
objective attribute of tuple t is 1 otherwise 0. 



Handling Categorical Objective Attribute. In the previous sections, we suppose 
the objective attribute is Boolean. In this section we expand our approach to 
handle cases in which the objective attribute is categorical. 

We reduced a classification problem with the categorical objective attribute 
to one with Boolean objective attribute using twoing 0 . Let C = {1,...,J} be 
the set of values of the objective attribute of the original problem, for 1 < j < J, 
we generate a series of sub-predictors R^ each of which predicts only whether 
or not the objective attribute of a tuple is j. As the sub-predictor RC we used 
a weighted majority decision among k region rules introduced in the case with 
the Boolean objective attribute. 

Let Rj be the ith-best region rule which predicts the objective attribute of 
a tuple is j or not and R^ = {R{, . . . , R^} denote the jth sub-predictor (fc is 
the number of voters). Using the redefined weight of voters w' as follows, R^ 
predicts the objective attribute of a tuple is j or not: 



w'{t,R^) 



( piRj) if t G R{ 

\piRl) iit^Rl ■ 



( 4 ) 



For 1 < J < J, we carry out weighted majority decision among region rules 
and sum up the weights of voters: gj{t) = Here, gj{t) gives the 

prediction accuracy of the jth sub-predictor. When the sth sub-predictor is most 
accurate, the value of the objective attribute can be predicted as s. Note that if 
a sub-predictor uses k voters out of mC '2 candidate voters, the total number of 
the region rules used for prediction is fc x J. 

Fig.0 shows an example of the region rules for a categorical objective at- 
tribute. This example shows the region rules (i?^, R^, and R^, k = 7) for the 
waveform-knoise dataset obtained from the UCI Machine Learning Repository. 
When the predictor says that the objective attribute of a tuple is 1, for instance, 
one must see only the seven region rules R^ = {i?j, . . . , i?y} for the related 
conditional attributes. 
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Fig. 2. Rectilinear convex region rules for a dataset with a categorical objective 
attribute. The training dataset is 4500 tuples of the waveform+ noise dataset. 



3 Experimental Results 

Test Datasets. We used eight public datasets with numeric conditional attributes 
(Table 0 that we acquired from the UCI Machine Learning Repository 0. 
These datasets are indicated in Table 0 along with the corresponding numbers 
of classes, tuples, and attributes. 

Implementations. We implemented the algorithm using the C++ language with 
the POSIX thread library on a Sun Microsystems Enterprise lOOOOQ. We gener- 
ated the number mC 2 of candidate voters in a parallel manner (m is the number 
of the conditional attributes). In a sequential execution, only a single thread 
iterates the generation of the candidates mC 2 times. We split this loop into N 
sub-loops and executed them using N threads on the processors. The number 
of threads N is decided according to the number of the attributes so that the 
calculation is evenly distributed among threads. The tuple data is stored in an 
array in the main memory before the creation of threads, and the threads ac- 
cess it during the generation of candidates. This memory access can be carried 
out without mutual exclusion, since the table is never updated. We sort the ob- 
tained voter candidates in increasing order of their entropy and use the k-hest 
candidates as the voters. 

For example, when the german-credit dataset (the number of the conditional 
attributes is 24) is used as the input data, the total number of iterations is 24 C 2 = 
276 = 46 X 6. Because the number of the available processors is 64, we split the 
iteration into 46 sub-loops, each of which generates six voter candidates. Using 

^ Commercially available shared-memory parallel computer. In this experiment we 
used an ElOOOO with 64 UltraSPARC!! (250MHz) processors. 
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Table 1. Summary of the eight datasets used in our experiments. 



Dataset 


^Class Tuple ^Attrib. 


breast-cancer-wisconsin* 


2 


699 


9 


german-credit (numerical) 


2 


1000 


24 


liver-disorders 


2 


345 


6 


pima-indians-diabetes 


2 


768 


8 


balance-scale 


3 


625 


4 


waveform 


3 


5000 


20 


waveform-l-noise 


3 


5000 


40 


vehicle 


4 


846 


18 



*We assigned the average value of the attribute to 
the 16 missing values contained. 



Table 2. Summary of ten-fold cross-validation results. 



Dataset 


Region 

Voting 


Region 
Single Tree 


See5/C5.0 
Single Tree 


See5/C5.0 

Boosting 


Err(%) 


#R 


N xN 


Err(%) #Leaf 


Err(%) #Leaf 


Err(%) #Leaf 


breast-cancer. 


3.0 


8 


10x10 


4.2 


3.3 


5.0 


13.7 


3.7 


165.7 


german-credit 


27.1 


5 


20x20 


23.8 


3.6 


27.7 


77.5 


25.1 


786.6 


liver-disorders 


34.2 


4 


20x20 


38.8 


3.2 


35.6 


25.7 


30.4 


217.8 


pima-indians. 


23.6 


10 


20x20 


25.1 


2.1 


26.1 


28.8 


25.0 


457.2 


balance-scale 


13.1 


6x3 


6x6 


15.5 


34.7 


22.6 


43.7 


18.3 


610.1 


waveform 


18.4 


7x3 


25x25 


21.0 


33.2 


23.3 


301.1 


17.7 


2325.3 


waveform-|-n. 


18.0 


11x3 


30x30 


21.8 


34.6 


24.3 


305.4 


17.1 


2253.7 


vehicle 


29.6 


7x4 


30x30 


28.5 


12.2 


28.0 


70.9 


24.1 


681.3 



46 threads, it takes 15 seconds to calculate the optimum regions on the planes 
spanned by all the pairs of conditional attributes. Compared to the elapsed time 
with only a single thread (344 seconds), more than 20- fold acceleration ratio was 
achieved. 

Classification Capability. Using the datasets shown in Table 0 a ten- fold cross- 
validation was carried out for each combination of the grid resolution and the 
number of the voters in order to evaluate the efficiency of the proposed method. 
The grid resolutions used were N x N (TV = 3, 5, 10, 15, 20, 25, 30, and 35). 
For the breast-cancer-wisconsin and balance-scale datasets, we used 10 and 6 
as the values of N, respectively, since the attribute values of these datasets are 
numerical but discrete. 

Table 0 presents the results of the ten-fold cross-validation tests. In this 
table, the “Region Voting’’’’ column shows the results of the majority decision 
among rectilinear convex region rules. is the number of the region rules 
used for weighted majority decision. N x N is the optimum grid resolution. For 
the datasets having the categorical objective attribute (lower four cases), the 
numbers of the used region rules are fcx^Class, where k is the number of the 
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(a) breast-cancer-wisconsin 



(b) german-credit (24 numerical) 





(c) liver-disorders 



(d) pima-indians-diabetes 





Fig. 3. The relation between ^voters and the error ratio of decision by majority 
(Boolean). Note that in the liver-disorder dataset, misclassification of a single 
tuple in its 35-tuple test data makes the error ratio 2.9% high. Thus, a dataset 
with relatively small number of tuples can cause oscillation in the graph easily. 



voters used to predict whether or not a tuple belongs to a particular class. For 
example, the dataset in Fig. Buses 7x3 region rules. The “Region Single Tree” 
column shows the error ratio and the tree size (number of the leaf nodes) of 
the decision trees using rectilinear convex region rules on each internal node 
(the original data appears in Table 3 of Note that the program used to 
generate rectilinear convex regions in this article is different from that used in 
Morimoto et al. in the latter study, optimized grid regions were generated 
with respect to the density of the tuples in a grid cell. The “See5/C5.0 Single 
Tree” and “See5/C5.0 Boosting” columns show the results of the decision tree 
using guillotine cutting on each internal node (generated by the See5 program 
with default parameters). The former column shows the results by a single 
tree, and the latter shows the boosted results by ten component classifiers. The 
size of a boosted classifier is estimated by the summation of the size of the ten 
component classifiers. 

For the datasets with Boolean objective attribute (the upper four datasets). 
Fig. &hows the relation between the number of the voters and the error ratio at 
the grid resolution shown in Table D For the datasets with a categorical objective 
attribute (the lower four datasets), if we regard #R/#Class as the number of 
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the voters, FigO shows the relation between the number of the voters and the 
error ratio at the grid resolution shown in TableQ. 



(a) balance-scale (b) waveform 




(c) waveform-fnoise (d) vehicle 



Fig. 4. The relation between ^voters and the error ratio of decision by majority 
(categorical). 



Comparisons. If the trees can make a high-accuracy prediction using fewer nodes 
than voters, then of course the majority decision among region rules might be 
useless. However, in a non-trivial dataset it is difficult to achieve high accuracy 
using only a few guillotine cutting rules. 

• Region Voting vs. See5/C5.0 Boosting. “Region Voting” won over “See5/C5.0 
Boosting” in the error ratio on three datasets and competed with the latter on 
two datasets (waveform and waveformT noise). However, note that the average 
size of boosted trees ranged from 165.7 to 2325.3 for the eight datasets. They 
were obviously unreadable. 

• Region Voting vs. Region Single Tree. “Region Voting” won over “Region Sin- 
gle Tree” in the error ratio on all the datasets except the german-credit dataset. 
However, the numbers of leaf nodes of the latter trees were particularly remark- 
able in the four datasets with a Boolean objective attribute; the average number 
of the leaf nodes for those datasets ranged from 2.1 to 3.6. Apart from the 
relatively high error ratio, this might be advantageous for readability. 
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• Region Voting vs. See5/C5.0 Single Tree. We can conclude that “Region Vot- 
ing” won over “See5/C5.0 Single Tree” on all the datasets both in the error ratio 
and readability (see also the comparison in Fig.^). 

Computing Cost. We will next consider the computing cost of generating a pre- 
dictor. According to Morimoto et al. 0, decision trees using rectilinear convex 
regions are more than 10 times as costly as those using guillotine cutting for 
a sample dataset (3000 tuples from the waveform dataset with the first 12 at- 
tributes). The drawback to constructing a tree with rectilinear convex region 
rules is that optimal regions must be calculated for all the attribute pairs re- 
cursively. In the method proposed here, on the other hand, the generation of 
voter candidates corresponds to that of the root node used in the decision tree 
with rectilinear convex region rules. Therefore, the weighted majority decision 
method is less costly than the decision tree method with rectilinear convex re- 
gion rules. However, it is more costly than a decision tree with guillotine cutting 
rules. See5 can generate a decision tree from 900 tuples of the german-credit 
dataset in 0.9 seconds^ On the other hand, our implementation of the weighted 
majority decision on an ElOOOO requires 15 seconds to generate voters using 46 
threads. 



4 Related Work 



Bagging 0 and boosting are methods for improving the total prediction 

accuracy by voting among a set of existent component predictors |2|i2|3|i4|il . 
Most of the studies have focused on improving the accuracy and stability of 
the prediction procedure and have used relatively many component predictors 
(i.e., more than ten). In terms of enhancing the understandability of predictors, 
transformation of the decision tree into a set of if-then rules m is one solution. 
The worth of such a set is represented by its encoding length theoretically based 
on the Minimum Description Length (MDL) Principle, and a good rule set is 
calculated by deleting redundant or less accurate rules. As previously described, 
relatively powerful rules on decision nodes can reduce the size of the tree. Two 
examples are a decision tree using a region rule on a node jSilpj , and a decision 
tree adopting a weighted majority decision on the nodes Q. Although the latter 
approach focuses on improving the prediction accuracy, a decision tree consisting 
of only a single node with a restricted number of voters can also be used for this 
purpose. 



5 Concluding Remarks 

We have proposed a decision by majority method using relatively powerful pre- 
dictors consisting of rectilinear convex region rules. Experiments using diverse 

^ On a PC/AT compatible with an Intel Pentiumll processor (300MHz) and Windows 
NT Workstation 4.0. 
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datasets confirmed that, even with less than ten voters, the achieved prediction 
accuracy is better than or comparable to that of decision trees and boosted deci- 
sion trees. Our decision majority method uses a list of rectilinear convex region 
rules. This list is much smaller and less structured than conventional decision 
trees. This property contributes to the readability of the predictors and thus is 
especially advantageous from the viewpoint of enhancing the comprehensibility 
of the obtained knowledge. 

Several topics remain to be pursued in future studies. For example, in the 
present method we select the voters according to the entropy and adopt the pre- 
diction accuracy for the training dataset as the weight of the voting. Although 
this works well experimentally, a more effective system might be developed. In 
addition, experiments in this article were restricted to the datasets with numeri- 
cal conditional attributes and a Boolean or categorical objective attribute. How- 
ever, it will also be important to accommodate datasets which include Boolean 
and categorical conditional attributes m and a numerical objective attribute. 
These goals can be realized by a natural expansion of the proposed weighted 
majority decision method, and are currently being investigated. 
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Abstract. Emerging patterns (EPs) are itemsets whose supports change 
significantly from one dataset to another; they were recently proposed 
to capture multi-attribute contrasts between data classes, or trends over 
time. In this paper we propose a new classifier, CAEP, using the follow- 
ing main ideas based on EPs: (i) Each EP can sharply differentiate the 
class membership of a (possibly small) fraction of instances containing 
the EP, due to the big difference between its supports in the opposing 
classes; we define the differentiating power of the EP in terms of the 
supports and their ratio, on instances containing the EP. (ii) For each 
instance t, by aggregating the differentiating power of a fixed, automat- 
ically selected set of EPs, a score is obtained for each class. The scores 
for all classes are normalized and the largest score determines t’s class. 
CAEP is suitable for many applications, even those with large volumes 
of high (e.g. 45) dimensional data; it does not depend on dimension re- 
duction on data; and it is usually equally accurate on all classes even 
if their populations are unbalanced. Experiments show that CAEP has 
consistent good predictive accuracy, and it almost always outperforms 
C4.5 and CBA. By using efhcient, border-based algorithms (developed 
elsewhere) to discover EPs, CAEP scales up on data volume and dimen- 
sionality. Observing that accuracy on the whole dataset is too coarse 
description of classifiers, we also used a more accurate measure, sensi- 
tivity and precision, to better characterize the performance of classifiers. 
CAEP is also very good under this measure. 



1 Introduction 

Classification is an important problem in data mining and machine learning, 
aimed at building a classifier from training instances for predicting the classes 
of new instances. Recently, datasets are becoming increasingly larger in both 
volume and dimensionality (number of attributes); a new challenge is the ability 
to efficiently build highly accurate classifiers from such datasets. In this paper 
we propose a new classifier, CAEP (Classification by Aggregating Emerging 
Patterns), which is suitable for many applications, even those with large volumes 
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of high dimensional data. The classifier is highly accurate, and is usually equally 
accurate on all classes even if their populations are unbalanced. These advantages 
are achieved without dimension reduction on data. 

CAEP is based on the following two main new ideas: 

(i) We use a new type of knowledge, the emerging patterns (EPs), recently 

proposed in p|, to build CAEP. Roughly speaking, EPs are those itemsets whose 
supports (i.e. frequencies) increase significantly from one class of data to another. 
For example, the itemset {odor=none, stalk-surface-below-ring = smooth, ring- 
number=one} in the Mushroom dataset m is a typical EP, whose support 
increases from 0.2% in the poisonous class to 57.6% in the edible class, at a 
growth rate of 288 (= For us, an item is a simple test on an attribute, 

and an EP is a multi-attribute test. Each EP can have very strong power for dif- 
ferentiating the class membership of some instances: if a new instance s contains 
the above EP, then with odds of 99.6% we can claim that s belongs to the edible 
class. In general, the differentiating power of an EP is roughly proportional to 
the growth rate of its supports and its support in the target class. 

(ii) An individual EP is usually sharp in telling the class of only a very small 
fraction (e.g. 3%) of all instances, and thus it will have very poor overall clas- 
sification accuracy if it is used by itself on all instances. To build an accurate 
classifier, we first find, for each class C, all the EPs meeting some support and 
growth rate thresholds, from the (opponent) set of all none-C instances to the 
set of all C instances. Then we aggregate the power of the discovered EPs for 
classifying an instance s: We derive an aggregate differentiating score for each 
class C, by summing the differentiating power of all EPs of C that occur in s; the 
score for C is then normalized by dividing it by some base score (e.g. median) 
of the training instances of C. Finally, we let the largest normalized score deter- 
mine the winning class. Normalization is done to reduce the effect of unbalanced 
distribution of EPs among the classes (classes with more EPs frequently give 
higher scores to instances, even to those from other classes). 

CAEP achieves very good predictive accuracy on all the datasets we tested, 
and it gives better accuracy than C4.5 and CBA on all except one of these 
datasets. (Note: we reported all datasets that we tested!) We believe that the 
high accuracy is achieved because we are using a new high dimensional method 
to solve a high dimensional problem: Each EP is a multi-attribute test and 
CAEP is using the combined power of an unbounded set of EPs to arrive at a 
classification decision. 

Being equally accurate on all classes is very useful for many applications, 
where there are a dominant class (e.g. 98% of all instances) and a minority class, 
and the sole purpose of classification is to accurately catch instances of the 
minority class. Classification accuracy is not the desired measure, as we would 
then consider the classifier which classifies all instances as in the dominant class a 
very good classifier. In this paper we also measure classifiers using sensitivity and 
precision, which reward classifiers that correctly label more minority instances 
and do not mislabel many other instances. 
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The CAEP classifier can be efficiently built for large high dimensional train- 
ing datasets in a scalable way, since EPs can be discovered efficiently, using 
border-based algorithms |S] and Max-Miner We can quickly produce CAEP 
classifiers for datasets such as Mushroom, whose records consist of 21 attributes. 
(Figure 9 of P shows that CPU time is 100 seconds for support threshold of 
0.1%. Border-based algorithms P then can find the EPs in around 0.5 hour.) 

Several parameters need to be selected. All these are done automatically, 
using the performance of the resulting classifier on the training instances as 
guidance. Because of the use of aggregation and perhaps normalization, we have 
not encountered the traditional overfitting problem in our experiments. 

Organization: We compare CAEP with related work below. §2 introduces 
EPs and preliminaries. §3 presents our main ideas on how to build CAEP. §4 
discusses how to efficiently discover the EPs, their supports and growth rates. 
§5 discusses how to reduce the number of EPs and §6 on the automatic selec- 
tion of CAEP’s parameters. §7 discusses the rice-DNA dataset and the need to 
use sensitivity and precision to measure classifiers. §8 contains the experimental 
results, and §9 offers some concluding remarks. 

Related Work: CAEP is fundamentally different from previous classifiers 
in its use of the new knowledge type of EPs. To arrive at a score for decision 
making, CAEP uses a set of multi-attribute tests (EPs) for each class. Most 
previous classifiers consider only one test on one attribute at a time; a few 
exceptions, X-of-N im . CBA m and linear decision trees 0, consider only one 
multi-attribute test to make a decision. 

Aggregation of the differentiating power of EPs is different from bagging or 
boosting E3, which manipulate the training data to generate different classifiers 
and then aggregate the votes of several classifiers. With CAEP, each EP is too 
weak as a classifier and all the EPs are more easily obtained. 

Loosely speaking, our aggregation of the power of EPs in classification is re- 
lated to the Bayesian prediction theory. For an instance t viewed as an itemset, 
Bayesian prediction would label t as Ck, where the probability Pr{t\Ck) * Pr{Ck) 
is the largest among the classes. The optimal Bayesian classifier needs to “know” 
the probability Pr(t\Ck) for all possible t, which is clearly impractical for high di- 
mensional datasets. Roughly speaking, CAEP “approximates” Pr{t\Ck)*Pr{Ck) 
using the normalized score. 

CAEP is the first application of EPs to classification. Partially influenced by 
CAEP, |8j proposes a different classifier, JEP-Classifier, also based on aggregated 
power of EPs. Major differences include: (i) CAEP uses general EPs, whereas 
JEP-Classifier uses exclusively jumping EPs (i.e. EPs whose support increases 
from zero in one dataset to non-zero in the other dataset), (ii) For datasets with 
more than two classes, CAEP uses the classes in a symmetric way, whereas JEP- 
Classifier uses them in an ordered way. (iii) In aggregating the differentiating 
power of all EPs, CAEP uses factors based on both support and support growth 
rate, whereas JEP-Classifier uses only the supports, (iv) As CAEP uses EPs 
with mixed growth rates, the reduction of the EPs is more complicated; for 
JEP-Classifier, all jumping EPs have infinite growth rates and the reduction is 
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simpler, (v) The normalization idea is used in CAEP but not in JEP-Classifier. 
The two classifiers offer their own advantages: CAEP is better for cases with few 
or even no jumping EPs whose supports meet a reasonable threshold (such as 
1%), whereas JEP-Classifier is better when there are many jumping EPs. Each 
of them is almost consistently better than C4.5 and CBA. 

2 Emerging Patterns and Preliminaries 

Assume the original data instances have m attribute values. Each instance in the 
training dataset T> is associated with a class label, out of a total of p class labels: 
Cl, C 2 , ..., Cp. We partition T> into p sets, 2?i, U 2 , with T>i containing all 

instances of class C,. 

Emerging patterns are defined for binary transaction databases. To find them, 
we may need to encode a raw dataset into a binary one: We discretize the value 
range of each continuous attribute into intervals Each {attribute, interval) 
pair is called an item in the binary (transaction) database, which will be repre- 
sented as an integer for convenience. An instance t in the raw dataset will then 
be mapped to a transaction of the binary database: t has the value 1 on exactly 
those items {A, v) where t’s A-value is in the interval v. We will represent this 
new t as the set of items for which it takes 1, and we will assume henceforth the 
datasets T>, T>i, ..., T>p are binary. 

Let I be the set of all items in the encoding. An itemset A is a subset 
of I, and its support in a dataset V', suppT>'{X), is • Given two 

datasets V and V' , the growth rate of an itemset X from T>' to T>" is defined 
as growthjratej),^T:,„{X) = suppv{X) 0; = 0 if suppx>'{X) = 

suppx>"{X) = 0; and = 00 if suppx>'{X) = 0 suppx>"{X). 

Emerging patterns 0 are itemsets with large growth rates from V to V" . 

Definition 1. Given growth rate threshold p > 1, an emerging pattern (p-EP 
or simply EP) from V to V" is an itemset e where growth_ratex>'^v"{e) > P- 

Example 1. Consider the following training dataset with two classes, V and M 
(this is actually an encoding of the Saturday morning activity example from till). 



V 


N 


{ 2,6,7,10 } { 3,5,7,10 } { 3,4,8,10 } 
{ 2,4,8,9 } { 1,4,8,10 } { 3,5,8,10 } 
{ 1,5,8,9 } { 2,5,7,9 } { 2,6,8,10 | 


{ 1,6,7,10 } { 1,6,7,9 } 

{ 3,4,8,9 } { 1,5,7,10 } 
I 3,5,7,9 } 



Then {1, 9} is an EP from class V to class Af with a growth rate |; it is also an 
p-EP for any 1 < p < |. Some other EPs are given later. 

3 Classification by Aggregating EPs 

We now describe the major ideas and components of the CAEP classifier: (1) how 
to partition the dataset to derive the EPs for use in CAEP, (2) how individual 
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EPs can differentiate class memberships, (3) how to combine the contribution 
of individual EPs to derive the aggregate seores, and (4) how to normalize the 
aggregate scores for deciding class membership. We also give an overview on how 
to construct and use CAEP. 



3.1 Partitioning Dataset to Get EPs of Classes 

For each class Ck, we will use a set of EPs to contrast its instances, T>k, against all 
other instances: We let =T> — T>k be the opposing class, or simply opponent, 
of T>k. We then mine (discussion on how is given later) the EPs from 2?^ to T>k; 
we refer to these EPs as the EPs of class Ck, and sometimes refer to Ck as the 
target class of these EPs. 

For Example^ some EPs of class N (i.e. from V to N) are (e : {!}, suppjg-(e) : 
0.6, growth jrate-p^j,^ : 2.7), ({1, 7}, 0.6, oo), ({1, 10}, 0.4, 3.6), ({3, 4, 8, 9}, 0.2, 
oo). Similarly, some EPs of V are ({2}, 0.44, oo), ({8}, 0.67, 3.33), ({4, 8}, 0.33, 
1.67). 

3.2 Differentiating Power of Individual EPs 

Each EP can sharply differentiate the class membership of a fraction of instances 
which contain the EP, and this sharp differentiating power is derived from the 
big difference between its supports in the opposing classes. Continuing with 
Example Q consider the EP ({1, 10}, 0.40, 3.60) for class Af. Suppose s is an 
instance containing this EP. What is the odds that s belongs to Af, given that s 
contains this EP? To simplify the discussion, we assume all classes have roughly 
equal population counts; then the answer is = 3 . 6 olfupp 7 +slpp^ = 

3 = 78%, since suppj^f = 3.60* supp-p. Without this assumption, we need to 

replace supports (e.g. suppjg-) by counts (e.g. suppjg- * countj,/ where countjg- is 
the number of instances of class Af), and similar odds can be obtained. Observe 
that this EP has no differentiating power on instances s' that do not contain the 
EP. So, assuming the population ratio in the training data accurately reflects 
the ratio in test instances and all classes have roughly equal population counts, 
this EP can differentiate the class membership with the probability of 78% for 
roughly _ q g * (i _(- g^) * suppjg- = 25% of the total population. 

The fraction of instances which contain an EP may be a very small fraction 
(25% above, but much smaller, e.g. 3%, in many examples) of all instances. 
Hence, it cannot yield very accurate predictions if it is used by itself on all 
instances. For example, if we applied the above EP on all instances, we would 
arrive at an overall predictive accuracy of roughly 0.25 * 0.78 = 19.5%. This 
would be much lower if coverage is only 3%. 



3.3 Better Overall Accuracy by Aggregated Score 

We noticed above that a single EP is sharp on predicting class membership of 
a small fraction of instances, but not on all instances. We now show how to 
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combine the strength of a set of EPs in order to produce a classifier with good 
overall accuracy. 

Roughly speaking, given a test instance s, we let all the EPs of a class Ci that 

s contains contribute to the final decision of whether s should be labelled as Ci. 

This gives us the advantage of covering more cases than each single EP can cover, 

because different EPs complement each other in their applicable populations. To 

illustrate, consider Example 0 The largest fraction of population that a single 

EP (e.g. {8}) can cover is around 50%, whereas the seven EPs given above in 

§3.1 have a much larger combined coverage, around ^ « 85.7%. 

How do we combine the differentiating power of a set of EPs? A natural way 

is to sum the contributions of the individual EPs. (Other possibilities exist, but 

are beyond the scope of this paper.) Now, how do we formulate the contribution 

of a single EP? Roughly, we use a product of the odds discussed earlier and the 

fraction of the population of the class that contain the EP. More specifically, let 

e be an EP of class C, we let e’s contribution be given by gi^owth_rate{e) ^ 
’ o j growthjrate{e)+i 

suppc(e). Observe that the first term is roughly the conditional probability that 

an instance is in class C given that the instance contains this EP e, and the 

second term is the fraction of the instances of class C that this EP applies. 

The contribution is proportional to both growth -rate{e) and suppc(e). We now 

define scores of instances for the classes. 



Definition 2. Given an instance s and a set E(C) of EPs of a class C discovered 
from the training data, the aggregate score ( or score ) of s for C is defined as 



score{s, C) 



E growth_rate{e) 

growth_rate{e) + l*^^PP^^^^- 

eCs,eaE(C) 



We now illustrate the calculation of contributions of EPs and scores of in- 
stances using Example Q and the instance s = {1,5,7, 9}. Among EPs of the 
growth rate threshold of 1.1, s contains 2 of class V: ({5}, 44%, 1.11), ({1, 5, 9}, 
11%, oo); it contains 10 of class A/": ({1}, 60%, 2.7), ({7}, 80%, 2.4), ({1,5}, 20%, 
1.8), ({1,7}, 60%, oo), ({1,9}, 20%, 1.8), ({5,7}, 40%, 1.8), ({7,9}, 40%, 3.6), 
({1,5, 7}, 20%, oo), ({1, 7, 9}, 20%, oo), ({5, 7, 9}, 20%, 1.8). The aggregate score 
of s for V is: score{s, V) = *0.44-|-g^^*0.11 = 0.52*0.44-1-1*0.11 = 0.33. 

Similarly, the contributions of the 10 EPs for Af are respectively 0.41, 0.56, 0.12, 
0.60, 0.12, 0.24, 0.31, 0.20, 0.20, 0.12, and their sum is scorers, I\f) = 2.88. 



3.4 Normalizing the Scores to Make Decision 

For each instance s, how do we use the p scores for all classes to predict its class? 

One might be tempted to assign to s the class label C, for which the score of 
s is the largest. This turns out to be a bad strategy. The main reason for this 
is that the numbers of EPs for different classes may not be balanced, which is 
a frequent scenario for applications where some classes may have more random 
(uniform) distributions of values and consequently fewer EPs. If a class C has 
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many more EPs than another class C , then instances usually get higher scores 
for C than for C , even for training instances of class C . This indeed happens, 
for example in the rice-DNA dataset (see §6), which consists of a positive class 
and a negative class. The negative class contains mostly “random” instances, 
and the ratio of the number of EPs of the positive to that of the negative is 28:1 
when the support threshold is 3% and the growth rate threshold is 2. 

Our solution to this problem is to “normalize” the scores, by dividing them 
using a score at a fixed percentile for the training instances of each class. More 
specifically, a base score for each class C, basescore{C), should be first found 
from the training instances of the class. The normalized score of an instance s for 
C, normscore{s,C), is defined as the ratio score{s,C) /basescore{C) . (Observe 
that our use of the term “normalized” is a slight abuse, since our normalized 
scores may be > 1.) Instead of letting the class with the highest raw score win, 
we let the class with the largest normalized score win. (We break tie by letting 
the class with the largest population win.) 

How do we determine the base scores? We can let basescore{C) be the me- 
dian of the scores of the training instances class C; that is, exactly 50% of the 
training instances of C have scores larger than or equal to basescore{C). We do 
not have to use 50%; in fact, other percentiles between 50%-85% give roughly 
similar results. The CAEP construction program should automatically choose 
a good percentile in this range, by testing the performance of the constructed 
classifier on the training instances. We do not want to use percentage on the 
two extreme ends (e.g. 3%), because the training instances usually contain some 
outliers, and if we use such a choice we let the outliers give too much influence. 

Example 2. For a simple illustration of the decision process, assume there are 
5 training instances from each of the positive (-l-ve) and negative (-ve) classes; 
assume the positive scores of the positive instances are 14.52, 15.28, 15.76, 16.65, 
18.44, and the negative scores of the negatives are 4.8, 4.97, 5.40, 5.47, 5.51. The 
(median) base scores for the positive and negative classes are respectively 15.76 
and 5.4. Given a test instance s (known to be from the negative class) with 
scores 7.07 and 4.82 for the positive and negative classes respectively, we have 
normscore{s, +ve) = 7.07/15.76 = 0.45 and normscore{s,—ve) = 4.82/5.4 = 
0.89. s is thus labelled as negative. 



3.5 The Entire Process 

The entire process for building and using CAEP is summarized below, assuming 
that the original dataset is partitioned according to the class labels. 

CAEP (training datasets I?i, . . . , for p classes Ci, • • • ,Cp) 

;; training phase 

1) Mine the EP set Ei from — T>i to T>i for each 1 < i < p; 

;; A growth rate threshold is given, or set to a default e.g. 2 

2) Optionally, reduce the number of EPs in each of Ei, • • • , Ep; 

3) Calculate the aggregate scores of all training instances for all classes; 
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4) Get the base scores basescore{Ci) for each class 

;; testing phase 

5) For each test instance s do: 

6) Calculate aggregate and normalized scores of s for each class Cp, 

7) Assign to s the class Cj for which s has the largest normalized score. 

4 Efficient Mining of EPs 

For the discovery of EPs we will be using methods introduced in m A key 
tool used by the efficient methods of |3 is that of borders, useful for the concise 
representation and efficient manipulation of large collections of itemsets. 

Example 3. An example border is <C = {{22}, {57}, {61}}, TZ = {{22, 34, 36, 
57, 61, 81, 85, 88}}>. The collection of itemsets represented by this border is 
{Y I 3X G C, 3Z G TZ such that X <GY Q Z}. Representative itemsets covered 
in the border include {22}, {22, 57}, {36, 57, 81, 88}. For interested readers, this 
is actually a border for the EPs from the edible to the poisonous class, of an en- 
coding of the Mushroom dataset, at support threshold 6 = 40% in the poisonous 
and growth rate threshold p = 2. 

To calculate the aggregate score contributed by all the EPs (meeting some 
thresholds) of a class C^, we need to (i) find the EPs of Ci and (ii) discover their 
supports and growth rates. We now list two possible methods: 

The large-border based approach: Max-Miner 0 is first used to efficiently 
discover the border of the large itemsets from T>i. (Such a border is called a 
large border, hence the word “large” in the title of this approach.) If the large 
itemsets represented by the border can be enumerated in memory, then with one 
more scan of T>i and 2?' we can get the supports and growth rates of the EPs 
of Ci- If it can be applied, this approach can discover all EPs whose supports in 
T>i are larger than the given support threshold. However, because some larger 
borders may represent “exponentially” many candidate itemsets, only a small 
portion of these candidates can be held in memory; we need to use the next 
approach. 

The border differential based approach: We first use Max-Miner Q to discover 
the two large borders of the large itemsets in T>i and the opponent T>[ having 
certain support thresholds. Then we use the MBD-LLborder (multiple-border 
differential) algorithm of 0 to find all the EP borders. Finally, we enumerate 
the EPs contained in the EP borders, and go through T>i and T>[ to check their 
supports and growth rates. With 13 EP borders of the Mushroom dataset for 
some support thresholds, using this approach we quickly found the supports and 
growth rates of 4692 EPs. Since MBD-LLborder only finds EPs whose supports 
in the second dataset are > one support threshold and in the first dataset are 
< another support threshold, we need to apply this method multiple times on 
multiple pairs of large borders, or combine it with the previous method, to 
get the important EPs satisfying the given support and growth rate thresholds. 
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5 Reduction of EPs Used 

Given a class C, we would like to find as many EPs as possible to give good 
coverage of the training instances; at the same time, we prefer EPs that have 
relatively large supports and growth rates, as these characteristics correspond to 
larger coverage and stronger differentiating power. Very often, many of the EPs 
can be removed without loss of too much accuracy, by exploiting relationships 
between the EPs. Reduction can increase understandability of the classifier, and 
it may even increase predictive accuracy. 

The reduction step is optional, and it should not be done if it leads to poor 
classification of the training instances. This is a training time decision. 

Our method to reduce the number of EPs uses these factors: the absolute 
strength of EPs, the relationships between EPs, and the relative difference be- 
tween their supports and growth rates. We measure the absolute strength of EPs 
using a new growth rate threshold p', which should be larger than the growth 
rate threshold p for the EPs. The main idea is to select the strong EPs and 
remove the weaker EPs which have strong close relatives. We will refer to the 
selected EPs as the essential EPs. 

To reduce the set of EPs, we first sort the mined EPs into a list E, in decreas- 
ing order on {growth_rate, support). The set of essential EPs, essE, is initialized 
to contain the first EP in E. For each next EP e in E we do 1 and then 2: 

1. For each EP x in essE such that e (Z x, replace a; by e if 1. a or l.b is true: 
l.a. growthjrate{e) > growth jrate{x) 

l.b. supp{e) » supp{x) and growth jrate{e) > p' 

2. Add e to essE if both l.a and l.b are false, and e is not a superset of any x 
in essE. 

We select EPs this way because: When condition l.a is true, e definitely 
covers more instances than x since e <Z x, and e has a stronger differentiating 
power than x because e has a higher growth rate. A typical situation captured by 
condition l.b is when x is an EP with growth rate oo but a very small support, 
whereas e is an EP whose growth rate is less than that of x but e has a much 
larger support than x. In this case, we prefer to have e since it covers many more 
cases than x and since it has a relatively high differentiating power already due 
to its growth rate being larger than p' . To illustrate this point, consider these 
two EPs of the Iris-versicolor class from the Iris dataset m- 
ei = ({1, 5, 11}, 3%, oo) 62 = ({11}, 100%, 22.25) 

62 is clearly more useful than ei for classification, since it covers 32 times more 
instances and its associated odds, 95.7%, is also very near that of the other EP, 
6 i. In our experiments, for l.b, we set the default value of p' to 20 and the 
default interpretation of the condition ‘‘‘'supp{e) >> supp(x)” is > 30. 

These parameters can be tuned based on coverage on training instances. 
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6 Selection of Thresholds and Base Scores 

To build a classifier from a training dataset, we need to select two thresholds 
(for support and for support growth rate), and a base score for each class. 

The selection of these can be done automatically with the guidance of the 
training data: We start with some default thresholds and percentiles for the base 
scores. Then a classifier is built and its performance on the training instances 
is found. We then let the program try several alternatives and see if significant 
improvements are made. The best choice is then selected. 

We observe from our experiments that the lower the support threshold 5, the 
higher predictive accuracy the classifier achieves; and for each support thresh- 
old, the higher the growth rate threshold, the higher predictive accuracy the 
classifier achieves. Once 5 is lowered to l%-3%, the classifier usually becomes 
stable in predictive accuracy; if the growth rate threshold is then raised to the 
highest possible (see discussion below), CAEP almost always has better predic- 
tive accuracy than C4.5 and CBA. In our experiments reported later, 5 is always 
between l%-3%; the exact choice of 5 is influenced by the how much running 
time is allowed for CAEP construction. 

The growth rate threshold also has strong effect on the quality of the classifier 
produced. Our general principle is to (a) mine EPs with a small initial growth 
rate threshold such as 2, and (b) automatically select a larger final growth rate 
threshold guided by the coverage of selected EPs on the training instances. Gen- 
erally, if the growth rate threshold is too high, the classifier would contain too 
few EPs and the classifier may have low accuracy because of poor coverage of 
the training instances. (Coverage of a set of EPs is measured by the number of 
zero scores it produces on the training instances: The fewer the number of zero 
scores the better the coverage.) On the other hand, if it can be done without low- 
ering the coverage of training instances, raising the growth rate threshold would 
always results in a classifier with higher predictive accuracy. Our experiments 
show that with support threshold l%-3%, the UCI datasets usually yield a huge 
number of EPs with growth rates from 1 to oo (Rice-DNA data is an exception, 
see § Q. The automatically chosen growth rate threshold is usually around 15. 

7 Rice-DNA, Sensitivity, and Precision 

Our motivation for more accurate measure of classifiers comes from the rice- 
DNA dataset (available at http:/ /adenine. krdl.org.sg:8080/limsoon/kozak/rice), 
containing rice-DNA Kozak sequences. 

A genomic DNA is a string over the alphabet of {A, C, T, G}. The context 
surrounding the protein translation start site of a gene is called the Kozak se- 
quence j^. Correct identification of such start sites from a long genomic DNA 
sequence can save a lot of labor and money in identifying genes on that sequence. 
The start site is always the A-T-G sequence. The context surrounding the A-T- 
G has been the most important information to distinguish real start sites from 
non-start sites. A context is typically taken from up to 15 bases up stream of 
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A-T-G to 10 bases down stream. So a Kozak sequence — for the purpose of this 
work — consists of 25 letters (excluding the A-T-G) . 

In the genomic DNA sequences, non-start sites (negative) overwhelm real 
start sites (positive) typically at a ratio of 24:1 or more. So a distinctive feature 
of the rice-DNA dataset is that the number of instances of the two datasets 
are very unbalanced. What makes the treatment of this dataset more difficult 
is that the number of positives, which is more important in reality, is far more 
the minority. With this very unbalanced dataset, even the just-say-no classifier, 
which always predicts an instance to be negative, will have an overall accuracy 
of II = 96%. Unfortunately, the fact is that not a single real start site has been 
identified, which is against our aim of classification. 

From this analysis we can see that in evaluating a classification method 
more meaningful measures than accuracy or error rate on the whole dataset are 
desirable. We will use a measure in terms of two parameters, namely sensitivity 
and precision, for each class, which have long been used in the signals world and 
in information retrieval 0. 

Given N instances whose class is known to be C, for a classifier P, if P labels 
N' instances as of class C, of which Ni are indeed to be of class C, then Ni/N is 
called P’s sensitivity on C, denoted sens(C), and Ni/N' is called P’s precision 
on C, denoted prec(C ). (For N' = 0, we define sens(C) = 0,prec(C) = 0. 

The performance of the just-say-no classifier on the rice-DNA dataset is: 
sens{-\-ve) = 0, prec{-\-ve) = 0, sens{—ve) = H = 100%, prec{—ve) = || = 
96%. 



8 Experimental Results 



We compare GAEP with the state-of-the-art classifiers G4.5 and GBA. (GBA was 
reported to beat G4.5 0.) Except for rice-DNA (see §6 for its description), the 
datasets we use are from the UGI machine learning repository uni- Furthermore, 
we test GAEP on datasets with a large number of records and where each record 
is long, where no results of G4.5 or GBA are known to us. 



Table 1: Accuracy Comparison 



Dataset 


T^records 


T^attrib- 

utes 


#classes 


C4.5 

(disc.) 


CBA 


GAEP Accuracy 
w/o red. red. 


#EPs/class 
w/o red. red. 


breast-w 


699 


10 


2 


96.1% 


96.1% 


97.85% 


97.70% 


7647 


233 


ionosphere 


351 


34 


2 


92.0% 


92.1% 


92.39% 


91.41% 


1813464 


7906 


iris 


150 


4 


3 


94.7% 


92.9% 


98.62% 


100% 


40 


4 


mushroom 


8124 


22 


2 


— 


— 


98.82% 


98.93% 


823649 


2738 


pima 


768 


8 


2 


72.5% 


73.0% 


96.23% 


96.56% 


1427 


57 


rice-DNA 


15760 


28 


2 


— 


55.3% 


70.87% 


75.63% 


30555 


24449 


sonar 


208 


60 


2 


72.2% 


78.3% 


83.90% 


90% 


651864 


4608 


tic-tac-toe 


958 


9 


2 


99.4% 


100% 


99.06% 


96.87% 


5707 


682 


vehicle 


846 


18 


4 


66.4% 


68.8% 


76.28% 


76.17% 


324048 


14857 


wine 


178 


13 


3 


92.1% 


91.6% 


99.38% 


99.44% 


175384 


1859 



CAEP: Classification by Aggregating Emerging Patterns 
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Table 2: Precision and Sensitivity of CAEP 



Dataset 


class(instance dist.) 


sensitivity 
w/o red. red. 


precision 
w/o red. red. 


breast-w 


benign (66%) 
malignant (34%) 


96.72% 97.17% 
100% 98.75% 


100% 99.34% 

94.26% 95.00% 


ionosphere 


good (64%) 
bad (36%) 


94.54% 92.9% 
88.33% 88.65% 


94.23% 94.15% 
91.33% 87.68% 


iris 


Setosa (35%) 
Versicolour (31%) 
Virginica (34%) 


100% 100% 
100% 100% 
96.00% 100% 


100% 100% 
96.33% 100% 

100% 100% 


mushroom 


edible (52%) 
poisonous (48%) 


99.43% 99.61% 
98.16% 98.35% 


98.32% 98.46% 
99.38% 99.34% 


pima 


positive (29%) 
negative (71%) 


95.52% 94.93% 
96.54% 97.23% 


92.27% 93.60% 
98.18% 97.95% 


rice-DNA 


positive (4%) 
negative (96%) 


77.01% 73.91% 
70.62% 75.70% 


9.58% 10.95% 
98.71% 98.63% 


sonar 


R (46%) 
M (54%) 


83.33% 93.33% 
84.09% 87.27% 


84.62% 87.32% 
88.46% 94.29% 


tic-tac-toe 


positive (65%) 
negative (35%) 


99.52% 96.01% 
98.19% 98.49% 


99.07% 99.21% 
99.13% 93.12% 


vehicle 


van (24%) 
saab (25%) 
bus (27%) 
Opel (24%) 


48.68% 53.95% 
65.62% 65.00% 
95.40% 90.72% 
92.93% 93.06% 


64.22% 62.94% 
59.23% 61.16% 
89.01% 88.93% 
88.31% 90.54% 


wine 


class 1 (40%) 
class 2 (33%) 
class 3 (27%) 


98.57% 100% 
100% 100% 
100% 98.00% 


100% 98.75% 
100% 100% 
98.00% 100% 



Discretization of continuous attributes is done by the entropy method 0, 
using code from the MLC++ machine learning library 0. All the results are 
obtained by 10-fold cross validation. Table 1 compares the overall predictive 
accuracy of CAEP, C4.5 (with discretization) and CBA. Dashes indicate that 
results are unavailable. (A recent test of C4.5 on rice shows that it has essen- 
tially 0% sensitivity and 0% precision - it is essentially the “just-say-negative” 
classifier which claims that everything is negative; observe that this gives an 
accuracy of about 96%. ) Columns 2, 3 and 4 describe the datasets: the numbers 
of records, of attributes and of classes respectively. Rice-DNA and Mushroom 
are the most challenging datasets, having both a large number of instances and 
high dimensionality. Observe that datasets of 2, 3 and even more classes are in- 
cluded, and that CAEP performs equally well. Columns 5 and 6 give the average 
predictive accuracy of C4.5 and CBA over 10-fold cross-validation, and Columns 
7 and 8 are that of CAEP, both before and after reduction. Columns 9 and 10 
give the average number of EPs in the classifier, before and after reduction. It 
can be seen that although the number of EPs has been dramatically reduced 
after the reduction process, there is no big loss in predictive accuracy and often 
there is an increase in accuracy. 

It takes the CAEP classifier almost no time to decide the class of an instance; 
e.g. only 0.01 second for a classifier with 10000 EPs. 
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Table 2 gives a more detailed characterization of CAEP on the datasets; 

sensitivity and precision on each class, before and after reduction, are listed. It 

shows that CAEP generally has good sensitivity and precision on each class. 

The positive sensitivity and precision on the rice-DNA dataset are better 

than the best observed neural network (NN) results known to us. 
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Abstract. An attribute-oriented induction is a useful data mining meth- 
od that generalizes databases under an appropriate abstraction hierarchy 
to extract meaningful knowledge. The hierarchy is well designed so as to 
exclude meaningless rules from a particular point of view. However, there 
may exist several ways of generalizing databases according to user’s in- 
tention. It is therefore important to provide a multi-layered abstraction 
hierarchy under which several generalizations are possible and are well 
controlled. In fact, too-general or too-specific databases are inappropriate 
for mining algorithms to extract significant rules. From this viewpoint, 
this paper proposes a generalization method based on an information 
theoretical measure to select an appropriate abstraction hierarchy. Fur- 
thermore, we present a system, called ITA (Information Theoretical Ab- 
straction), based on our method and an attribute-oriented induction. We 
perform some practical experiments in which ITA discovers meaningful 
rules from a census database US Census Bureau and discuss the validity 
of ITA based on the experimental results. 



1 Introduction 

Since the late 1980’s, studies on Knowledge Discovery in Databases (KDD) has 
been paid much attentions. Briefly speaking, KDD processes can be divided 
into the four processes ((1) Data selection, (2) Data cleaning and pre-processing, 
(3) Data mining, and (4) Interpretation and evaluation) m- The third pro- 
cess, data mining, is especially considered as a central one for extracting useful 
knowledge from large databases very efficiently, and many studies concentrate 
on developing the methods for it. However, it is also well known that they often 
detect meaningless rules that do not meet user’s intention. One of the reasons 
seem to lie in a fact that something irrelevant to the user’s intention still remains 
in the data on which mining process operates. The second process, data cleaning 
and pre-processing, must be useful to exclude the irrelevant data. 

In the literatures m, some common techniques used in the pre-processing 
are described. For instance, a database query language like SQL specifies a part 
of database with which the mining process is concerned. Furthermore, given an 
appropriate conceptual hierarchy, a notion of generalization of databases has 
been introduced, in an Attribute-Oriented Induction algorithm |5] implemented 
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in DBMiner 0, to generalize databases and to prevent KDD processes from 
extracting meaningless rules. 

However, SQL approach is too strict, for user must specify which part is 
relevant to his/her mining problem. Moreover, in the case of generalization of 
databases, users or system administrators are required to have good domain 
knowledge to provide just one appropriate conceptual hierarchy before mining 
processes. It seems a hard task even for user to give an answer for these re- 
quirements. For this reason, we consider that following functions are necessary 
to support the decision about the relevance. 

— To predict user’s intention by less number of queries and to focus on the 
important relationship or structure among data. 

— To automatically select abstraction that is adapted for the target for which 
user want to discover and to generalize databases by the selected abstraction. 

In particular, we consider in this paper the second problem on the automatic 
selection of generalizations. Here the generalization means an act of substituting 
the concepts at abstract level for the attribute values in the database at con- 
crete level. The generalization directly reflects the quality and the significance 
of extracted knowledge from the generalized databases. This is because, if some 
significant differences between data values are missed by the generalization, we 
loose a chance to And a significant rule in the generalized database. 

Thus it is very important to have an appropriate generalization according 
to user’s intention. Although there may exist various ways to define the notion 
of user’s intention, we assume in this paper that user intends to have a more 
understandable decision tree whose accuracy is high. More precisely, given a tar- 
get class, the understandability and the accuracy are measured by the number 
of nodes the decision tree has and by its error ratio with respect to the target 
class, respectively. Generalization according to such a criterion about user’s in- 
tention must select an appropriate abstraction not to increase the error ratio of 
the decision tree and to decrease the size of it. The decision tree after gener- 
alization based on appropriate abstraction will be more compact and have the 
classification ability approximately equivalent to one before generalization. 

Although the final goal is to present a method to generate such a general- 
ization, it is generally understood as a hard task to synthesize a generalization 
hierarchy, as is known in the fields of natural language processing and informa- 
tion retrieval. For this reason, this paper tries to solve a problem of selecting an 
appropriate hierarchy among possible ones. More precisely, we consider the hier- 
archy as a layered abstraction, which is defined as a grouping of attribute values 
at concrete level, and propose a method that selects an appropriate grouping 
from such ones. In this paper, therefore, the problem of generalizing databases 
is regarded as the problem of selecting abstraction of attribute values. 

The method presented in this paper adopts an information theoretical mea- 
sure to select an appropriate abstraction and controls the generalization. In 
principle, given a target class, a grouping of tuples in a relational database, an 
abstraction preserving the class distribution of the attribute values is preferred 
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and selected. If the attribute values share the same or a similar class distribu- 
tion, they can be considered not to have significant differences about the class. 
So they can be abstracted to a single value at abstract level. On the other hand, 
when the values have distinguishable class distributions, the difference will be 
significant to perform the classification in terms of attribute values. Hence, the 
difference should not be disregarded in the abstraction. The classification under 
an attribute before applying abstraction has many branching factors. An appro- 
priate abstraction for the attribute is efficient for merging them into one. Thus 
the size of decision tree can be decreased by the abstraction. 

Moreover, since the class distributions before and after the abstraction are 
the same or almost similar, the generalized database according to the abstraction 
has the same or almost similar ability of discriminating and characterizing target 
classes in terms of attribute values. For these reasons, an appropriate abstraction 
discussed in this paper will satisfy the usr’s intention that prefers more compact 
decision tree whose error ratio is low. 

To measure the difference of class distributions, it is shown that the same 
measure as used in C4.5^D| can be adopted. Based on this observation, we 
propose and develop a system ITA (Information-Theoretic Abstraction) based 
on our method and an Attribute-Oriented Induction. 

In this paper, in Section 2 we explain the Attribute-Oriented Induction and 
Section 3 describes a principle of ITA. In Section 4, we evaluate ITA with some 
experiments on census database in US Census Bureau, and Section 5 concludes 
this paper with a summary. 

2 Attribute-Oriented Induction 

2.1 Overview 

Attribute-Oriented Induction!^ has been developed for mining knowledge from 
databases, and is currently implemented in a KDD system named DBMiner^. 

The task of the induction method is to find two types of rules, character- 
istic rules and discrimination rules, for certain target classes from a relational 
database. A class is denoted by a certain value Vi of an attribute Aj. If a tuple t 
has Vi as its A- value, then it is said that t belongs to the class Vi. That is, the class 
can be interpreted as a collection of tuples each of which has Vi as its A-value. A 
characteristic rule and a discrimination rule for a class represent necessary and 
sufficient conditions of a tuple for belonging to the class, respectively. 

Given a relational database and a target class, in order to obtain these rules 
for the class, the database is transformed into more general one. More precisely 
speaking, some attribute values are abstracted by replacing them with more ab- 
stract one. It should be noted that some different tuples might be transformed 
into the same one by the generalization. That is, overlapping tuples are gener- 
ated. In this case, they are merged together in the resultant database. 

Such a generalization process is carried out according to eonceptual hier- 
archies that are provided by knowledge engineers or domain experts prior to 
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the process. A conceptual hierarchy represents a taxonomy of the values of an 
attribute domain that are partially ordered according to a abstract-to-specific 
relation. After the generalization process, characteristic and discrimination rules 
are extracted from the generalized database. These processes are summarized as 
the following algorithm (for more details, see the literature 0): 

Algorithm 2.1 (attribute-oriented induction) 

input : (1) a relational database, (2) a target class (a target attribute value), 
(3) conceptual hierarchies and (4) a generalization threshold value, 
output : Rules characterizing or discriminating the target class. 

1. By some relational operations (e.g. projection and selection), extract a data 

set that are relevant to the target class. 

2. Generalize the database according to the hierarchies: 

(a) Replace attribute values with more abstract ones in the hierarchies. If 
for an attribute, a large number of the attribute values still remain in 
the database and further generalization for them cannot be performed, 
it is considered as a redundant attribute and is removed. 

(b) Merge overlapping tuples into one. Then, a special attribute vote can 
be added to the generalized database to record how many tuples in the 
original database are merged together as the result. 

(c) Repeat above two steps until the number of tuples becomes below the 
threshold value. 

3. Extract characteristic or discrimination rules from the final database. 



Name 


Category 


Major 


Birth_Place 


GPA 


{ computing, math, biology, statistics, physics } C science 


Anderson 


M.A. 


history 


Vancouver 


3.5 


{ music, history, literature } C art 


Bach 


junior 


math 


Calgary 


3.7 


{ science, art } C ANY(Major) 


Carey 


junior 


literature 


Edmonton 


2.6 


{ freshman, sophomore, junior, senior ) C undergraduate 


Fraser 


M.S. 


physics 


Ottawa 


3.9 


{ M.S., M.A., Ph.D. } c graduate 


Gupta 


Ph.D. 


math 


Bombay 


3.3 


{ undergraduate, graduate } c ANY(Category) 


Hart 


sophomore 


chemistry 


Richmond 


2.7 


{ Burnaby, Vancouver, Victoria, Richmond } c British_Columbia 


Jackson 


senior 


computing 


Victoria 


3.5 


{ Calgary, Edmonton } c Alberta 


Liu 


Ph.D. 


biology 


Shanghai 


3.4 


{ Ottawa, Toronto } C Ontario 


Meyer 


sophomore 


music 


Burnaby 


2.9 


{ Quebecl, Quebec2, Quebec3 }C Quebec 


Monk 


Ph.D. 


computing 


Victoria 


3.8 


{ Bombay ) C India 


Wang 


M.S. 


statistics 


Nanjing 


3.2 


{ Shanghai, Nanjing } C China 


Gaboury 


M.A. 


history 


Quebec 1 


3.6 


{ China, India ) c foreign 


Jacoby 


M.A, 


literature 


Quebec2 


3.5 


{ British_Columbia, Alberta, Ontario, Quebec } c Canada 


Cai 


M.A. 


literature 


Quebec3 


3.7 


{ foreign, Canada } c ANY(Birth_Place) 


Tomas 


M.A, 


history 


Vancouver 


3.4 


{ 2.0 - 2.9 } C average 


Han 


M.A. 


history 


Calgary 


3.3 


{ 3.0-3.4 }c good 


Cersonct 


M.A, 


history 


Vancouver 


3.0 


{ 3.5 - 4.0 } C excellent 


Wise 


freshman 


literature 


Toronto 


3.9 


{ average, good, excellent }C ANY(GPA) 



a) A student database 



b) Conceptual hierarchies 



Target class 


Majoi 


Birth_Place 


GPA 


vote 


graduate 


art 

scienct 

scienct 

art 


Canada 

Canada 

foreign 

Canada 


excellen 

excellen 

good 

good 


4 

2 

3 

3 



c) A generalized student database 



Fig. 2.1. A generalization of student database 
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Example 2.1 Let us consider a student database |7] and conceptual hierarchies 
shown in Figure mi In the hierarchies, “{ • • • , Cj, • • • } C C"” means that C is 
an abstraction of Ci. For the target class “graduate” of the attribute “Category” 
and the threshold value of 4, a generalized database can be obtained as shown 
in the figure. For example, from the database, we can extract a discrimination 
rule: if a student’s major is art, his birth place is Canada and he has an excellent 
CPA, then he is a “graduate” student. 

2.2 Problems 

As described above, discovered rules are highly dependent on given conceptual 
hierarchies. However, such dependence would cause the following problems: 

— By generalizing the original database, we might lose some important differ- 
ence among attribute values that is relevant to our classification task. There- 
fore the generalization process should be controlled carefully to prevent such 
a loss of necessary information. However, the generalization threshold seems 
not to work very well for this important purpose. 

— In fact, since user has many intentions for the goal of knowledge discovery, 
there might exist several ways of generalizing the original database. However, 
a given conceptual hierarchy is assumed to have a single inheritance (a tree 
structure). It is implied that such a restricted hierarchy might be able to 
work only for an user’s certain intention. 

To over come these problems, the following section proposes a new general- 
ization control method in which an information theoretical measure is adopted 
to evaluate an appropriateness of generalization. 

3 Information Theoretical Abstraction System 

In this section, we present a new generalization method ITA based on a notion 
of information gain ratio and propose the attribute-oriented induction based on 
it We firstly explain it and discuss its property. 

3.1 Behavior of Information Gain Ratio 

Let a data set S' is a set of instances of a relational schema R{Ai , ..., Am), where 
Ak is an attribute. Furthermore we assume that user specifies an assignment C 
of a class information to each tuple in S. We regard it as a random variable with 
the probability 

Pr(C = c,) = /reg(Q,S)/|S|, 

where |S| is the cardinality of S, and freq{cj,S) denotes the number of all 
tuples in S whose class is c^. Then the entropy of class distribution (Pr(G = 
ci),...,Pr(C = Cn)) over S is given by 

n 

H{C) = - S Pr(C = Cj)log 2 Pr(G = c*) 



(3.1) 
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Now, given an attribute value aj of ^ we obtain a posterior 

class distribution (Pr(C = ci|^ = Oj), Pr(C = Cn\A = aj)) that has the 

n 

corresponding entropy H{C\A = aj) = — U Pr(C = Ci\A = aj)log 2 Pr(C = 

i=l 

Ci\A = aj), where A is also regarded as a random variable with the probability 
Pr(A = aj) = freq{aj, S) /\S\ using its frequency. The expectation of these 
entropies of posterior class distributions is called a conditional entropy H{C\A): 

H{C\A) = h Pr(^ = aj)H{C\A = aj) ( 3.2) 

i=i 

The subtraction of equation i 3.2 | from equation |3.1 [ gives an information gain, 
that is also called a mutual information I{C; A). 

gain{A, S) = H{C) - H{C\A) = I{C- A) ( 3.3) 

To normalize the information gain, the entropy of an attribute A = {ai,...,o^}, 
called an split information, is used. 

splitJnfo{A,S) = H{A) = — JJ Pr(A = aj)log 2 Pr(A = aj)) ( 3.4) 

i=i 

Finally, the information gain is divided by the split information. The normalized 
information gain is called an information gain ratio and given by the following 
formula 

gainjratio{A, S) = gain{A, S) / split Jnf o{A, S) = I{C\ A)/H{A) ( 3.5) 

The information gain ratio is defined in terms of posterior class distribu- 
tion given attribute values. Thus it depends on how the prior class distribution 
(Pr(C = Cl), ..., Pr(C = Cn)) changes to the posterior ones (Pr(C = ci|A = 
Ofe), ..., Pr)^ = CnjA = Ofe)). As the posterior class distributions show higher 
probabilities(frequencies) for some particular classes, the measure becomes higher. 
Also, if the class distribution is nearly even for each classes, the measure shows 
a lower value. 

3.2 Preservation of Class Distribution Using Information Gain 
Ratio 

The class distribution has been considered as a useful feature for classification. 
This paper also uses it for generalization by an abstraction of attribute values. 
The principle is simply stated as follows: 

If some attribute values oi, ..., am of an attribute A = {oi, ..., o^}(m < £) 
share an almost same or similar posterior class distribution (Pr(C = 
ci|A = aj), ...,Pr(C = c„|A = aj)), an ’’abstract class distribution”, 
defined as 

(Pr(C = ci|A G {oi, ...,Om}), ...,Pr(C = c„|A G {oi, ...,a„})) 

= ( U- lA,- Pr(C = ci|A = aj ), ..., U^,Aj Pr(C = c„|A = aj) ) 
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also shows an almost same or similar class distribution, where Xj = 
Pr(A = Qj)/ SjXiPT{A = Uj). Thus, an abstraction identifying these 
ai, keeps the necessary information about classes, so they can be 

abstracted to a single abstract value. In other words, we consider only 
abstraction that preserves class distribution. 

Although there exist several ways to define the similarity between distributions, 
we use the information gain ratio to approximate it. The point is the change 
of the information gain ratio when the attribute of A is abstracted 

to a single value. An abstraction is considered as a mapping / : A — *■ /(A), 
where /(A) is an attribute at abstract level. The abstraction of ai, ..., am is now 
written as /(cj) = dk G /(A) (1 < j < m). /(A) can be a random variable with 
Pr(/(A) = ak) = Pr(A e /"^(ofe)) = Ea,G/-i(A) = “*)' 

Now consider posterior class distributions at abstract level 

(Pr(C = ci|/(A) = h,), ...,Pr(C = c„|/(A) = d,)) (1 < j < |/(A)|) 

and the corresponding conditional entropy H{C\f{A)). In general, according to 
the basic theorem of the entropy and the data-processing theorem, described in 
the literature 13 , the following inequalities hold between H{C\A) and H{C\f{A)): 

H{C\A) < H{C\f{A)), I{C- A) > J(C; /(A)) 

Hence the information gain decreases after the abstraction of attribute values. 
More precisely, the difference e(/) = H{C\f{A)) — H{C\A) is calculated by the 
following formula: 

'TL 

e(/) = A Pr(/(A) = hj) S D{a, aj) 

j=l i=l 

f raj \ raj 

D{ci,dj) = L [ S XjkPr{C = Ci\A = a^k) ) - S XjkL{Pr{C = c*|A = ajk)), 

\A;=1 / k—1 

where L{x) = — a;log 2 a:: defined on [0,1], f~^{dj) = {oji, ..., }, and Xjq = 

Pr(A = a,,)/ASPr(A = a,,). 

The subtraction D(ci,dj) means the difference between the L-value of the 
mean of Pr(cijajfc) and the mean of L-value of Pr(cijajfe), where the mean 
is due to the distribution (Aji, ..., Ajm,)- Thus, as the posterior probabilities 
Pr(ci\ajk) for k G {!,..., rrij} are within a smaller range of values, D(ci,aj) 
tends to become smaller, since the mean ^ Xk Pr(ci\ajk) is also within the range 
of values. By the same reason, the difference between ^ Afc Pr(cijajfe) (that 
is, the posterior class distribution after applying abstraction) and Pr(ci|ajfe) 
for 1 < A: < mj(that is, one before applying abstraction) tends to become 
smaller as well. Particularly, if the posterior class distributions before the ab- 
straction, (Pr(ci joji), ..., Pr(c„|ajm 3 )), are all identical, then the difference e(/) 
becomes 0. That is, all the class distributions before and after the abstraction 
are the same one. On the other hand, as the posterior probabilities Pr{ci\ajk) 
for /c G {1, ...,77ij} are within a wider range of values, the difference e(/) tends 
to become larger. 
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From these observations, we propose to adopt the information gain as a 
measure to decide whether the posterior class distribution before the abstraction 
are almost same (that is, similar) or not. More concretely speaking, we consider 
that an abstraction mapping / with higher I{C; f{A)) is more preferable. 

We furthermore use the split information H{f(A)) to compare two or more 
preferable abstractions. According to the entropy theory, H{A) > H{fi{A)) > 
H{f 2 {A)) holds, provided f\ is a refinement of /2 (that is, / 2 (ai) = / 2 (a 2 ) when- 
ever /i(ai) = /i(a 2 )). Hence, dividing I{C;f{A)) by H{f{A)), the information 
gain ratio tends to favor an abstraction / that identifies more number 

of attribute values with an abstract value. Under such a mapping /, we will 
obtain simpler generalized database than others. This will help classifiers to per- 
form their classification tasks. This is the reason why we adopt the information 
gain ratio, just as in the case of C4.5, to compare several abstractions. 



3.3 Information Theoretical Abstraction Algorithm 

Based on the above discussion, we present here an algorithm of Information 
Theoretical Abstraction. In the algorithm, the term “change ratio” means the 
ratio of the information gain ratio after applying generalization to one before 
applying generalization. 

Algorithm 3.1 (information theoretical abstraction) 
input : (1) a relational database, (2) a target class (a target attribute value), 
(3) conceptual hierarchies and (4) a threshold value of the change ratio, 
output : a generalized database. 

1. Select an attribute from the database before applying generalization and 
compute the information gain ratio for the attribute. 

2. Compute the information gain ratio for all abstractions in the hierarchies for 
the attribute and select an abstraction with the maximum information gain 
ratio. 

3. Compute the change ratio. If it is above the threshold value, substitute 
abstract values in the abstraction for the attribute values in the database. 
Otherwise the attribute is removed. 

4. Merge overlapping tuples into one, count the number of merged tuples and 
record it in the attribute vote. 

5. Repeat above four steps for all attributes. 

By replacing the generalization process in the Attribute-Oriented Induction 
Algorithm in Figure r2m with our algorithm, we can propose a new KDD system, 
ITA, that appropriately controls the generalization process in attribute-oriented 
induction. Since our ITA can automatically select an appropriate abstraction 
hierarchy and ascend to an appropriate level of the hierarchy, it does not suffer 
from the problems pointed out in Section2. 
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3.4 An Example of Knowledge Discovery by ITA 

As discussed above, based on the information gain ratio, we can select an ap>- 
propriate abstraction from a conceptual hierarchy with multiple inheritances. 
According the selected abstraction, we generalize a given database and try to 
extract meaningful knowledge from the generalized database. We show here an 
example of knowledge discovery by our ITA method. 

Example 3.1 Let us consider a conceptual hierarchy for an attribute Birth_ 
Place shown in Figure rd.ll The hierarchy involves multiple inheritances derived 
from the two points of views: Canada^ Foreign and Language. From the former 
viewpoint, the birth places are abstracted depending on whether they belong to 
Canada or not. From the latter, they are abstracted depending on their official 
languages. 



Table 3.1. Generalized database (viewpoint: Language) 



Class 


Birth_Place 


GPA 


vote 


support 


confidence 


Mark 




Others 


good 


3 


0.16667 


1.00000 




graduate 


English 


excellent 


3 


0.16667 


0.50000 


*0 




French 


excellent 


3 


0.16667 


1.00000 






English 


good 


3 


0.16667 


1.00000 




undergraduate 


English 


average 


3 


0.16667 


1.00000 






English 


excellent 


3 


0.16667 


0.50000 


*0 



Under the conceptual hierarchy, we try to extract useful knowledge from the 
student database in Figure I 'Z.Ll that discriminate the target classes graduate 
and undergraduate of the attribute Category, where we assume the threshold 
value of change ratio is 1.0. 

Based on the result of computation of the information gain ratio, ITA con- 
siders the attribute art to be irrelevant to the classification task and removes 
it. Furthermore, from the viewpoint of Language, ITA selects an appropriate 
abstraction in the hierarchy and generalizes the original database by the means 
of the abstraction. The obtained database is shown in Table rm For example, 
from the first tuple of the generalized database, we can obtain a discrimination 
rule for the class graduate: 
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graduate <J= {Birth-Place = Others) A {GPA = good) [sup : 0.167 con : 1.000]. 
We refer to the values of support (sup) and confidence (con) to evaluate the qual- 
ity of the rule. The support value shows the coverage of the rule in the original 
database and the confidence value shows the accuracy of the discovered rule. 

The tuples with “*0” for Mark in Table r^TTI have a low confidence value 0.5. 
Therefore the rules extracted from them would be unreliable. On the other hand, 
rules extracted from the tuples without “*0” are expected to reliably work. 

Table 3.2. Generalized database (viewpoint: Canada&z Foreign) 



Class 


Birth.Place 


GPA 


vote 


support 


confidence 


Mark 


graduate 


foreign 

Canada 

Canada 


good 

excellent 

good 


3 

6 

3 


0.16667 

0.33333 

0.16667 


1.00000 

0.66667 

1.00000 


*0 


undergraduate 


Canada 

Canada 


average 

excellent 


3 

3 


0.16667 

0.16667 


1.00000 

0.33333 


*0 



Let us assume that we select an abstraction from the viewpoint of CanadaSz 
foreign. By means of the abstraction, a generalized database shown in Table EH 
can be obtained. From the tuples with “*0” in the database, two unreliable rules 
that would are extracted. These tuples corresponds to 9 tuples in the original 
database, while ones with “*0” in Table |TT1 corresponds to 6 tuples. That is, 
more number of tuples might be discriminated incorrectly by the rules extracted 
from this generalized database. Therefore the former abstraction from the view- 
point of Language selected by ITA would be considered more appropriate than 
the latter from the viewpoint of Canada&iForeign. 

4 Experiments on a Census Database 

We have made some experimentations using our ITA system implemented in Vi- 
sual C-|— I- on PC/AT. This section shows the experimental results and discusses 
its usefulness. 

In our experimentations, we try to discover meaningful knowledge from a 
Census Database in US Census Bureau found in UCI repository |]0. The database 
consists of 32561 tuples each of which has values for 15 attributes including age, 
marital status, hoursjperjweek, salary, etc. Apart from the database, a small 
database consisting of 15060 tuples is prepared in order to check usefulness of 
discovered knowledge (it is refereed to as test data). Concept hierarchies for the 
attribute values are constructed based on an electronic dictionary WordNet^ 
and is given to our system. It should be noted that there exist many multiple 
inheritances in the hierarchies. 

4.1 Discovery of Discrimination Rules 

Assuming that target classes are “< ShOTT’ (NOT more than $50000) and “> 
$50AT” (more than $50000) of the attribute salary and a threshold of the change 
ratio is 1.3, we try to discover discrimination rules for the classes. 
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The obtained generalized census database is shown in Table 1 4 . 1 1 For the 
class “< $50iF” , our ITA system discovered two rules: if a person is middle- 
aged, single and his working hours is middle, or if he is youth, single and his 
working hours short, then his salary is not more than $50,000. From the values 
of support and confidence for the rules, they cover 23% of tuples in the original 
census database and are able to be highly confided in. Therefore it would be 
considered that our system discovered useful discrimination rules for the class. 

On the other hand, the rules obtained for the class “> $50AT” seems not to 
be useful from the viewpoint of confidence. It implies that in order to discover 
useful discrimination rules for the class, we should adjust a threshold value of 
the change ratio so that we can obtain more detailed rules. 



4.2 Discovery of Decision Tree from Generalized Database 

Our ITA system can appropriately generalize a given original database according 
to a given conceptual hierarchy. Since such a generalized database does not 
contain too detailed descriptions, we can expect to obtain useful decision trees 
from the generalized database that are more compact than ones obtained from 
the original database. For the census database, we compare here these decision 
trees from the viewpoints of the size and error rate. 

Basically according to Algorithm i p.ii the generalized database is obtained 
by substituting the attribute values in the census database with the abstrac- 
tion in the conceptual hierarchy. It should be noted that inadequate attributes 
and overlapping tuples are remaining in the resultant database. The generalized 
database is given to C4.5 in order to obtain a decision tree that is refereed to as 
a decision tree by “ITA-I-C4.5” . For several thresholds of the change ratio, we 
construct decision trees and compare them with one obtained from the original 
census database by C4.5. The test data is used to evaluate the accuracy of them. 
The experimental results is shown in Table 

Table 4.1. Discovered discrimination rules from the census database 



Class 


age 


marital 


hours 


vote 


support 


confidence 


Mark 


< $50if 


middle 


single 


middle 


4461 


0.15 


0.92 


*0 




youth 


single 


short 


2283 


0.08 


0.99 


*5 


> S5QK 


middle 


married 


long 


2773 


0.09 


0.59 


*7 




old 


married 


long 


21 


0.01 


0.62 


*22 



Table 4.2. An estimation of the decision tree 





C4.5 


ITA-I-C4.5 


threshold 


— 


0.00 - 0.54 


0.55 - 0.76 


0.77 - 1.08 


1.09 - 1.10 


1.11 - 1.46 


decision tree size 
error rate (the database) 
error rate (the test data) 


982 

13.9% 

17.6% 


93 

18.9% 

19.1% 


101 

17.3% 

17.8% 


65 

17.5% 

17.9% 


101 

17.2% 

17.8% 


313 

16.5% 

17.3% 
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In all range of the thresholds, C4.5 can construct better trees than ones by 
ITA+C4.5 with respect to the error rate for the original database. However, we 
can not find any remarkable difference between the error rates for the test data. 
Therefore, it is considered that the decision tree by C4.5 would be too-specific 
(over-fit) to the original database. On the other hand, ITA-I-C4.5 can construct 
the decision trees for which the error rates are almost equal for both the original 
and test database. Furthermore, compared with the size of the decision tree by 
C4.5, the sizes of ones by ITA-I-C4.5 within the thresholds 0.00 to 1.10 are quite 
small. The ratio is about 0.1. Therefore, it is considered that ITA is very useful 
to decrease the size of decision tree still preserving the accuracy. 

5 Conclusion 

In this paper, we proposed a KDD method, ITA, in which an adequate ab- 
straction in a given concept hierarchy is selected according to the information 
theoretical measure. 

As we disscussed in the introduction, we assume that users prefer more com- 
pact decision tree whose error ratio is low. However, among such decision trees, 
we can easily find trivial ones that are kinds of default knowledge users already 
know. For the generalized database based on an appropriate abstraction does not 
loose a sufficient and necessary information needed to perform the classification 
with respect to a given target class, the problem of detecting trivial ones is due 
to the classsification method rather than abstraction. 

One possible way to resolve the problem seems to use the abstraction itself to 
detect the problem of trivial rules. As the decision trees become more compact, 
users get to have more chances to realize that the discovered rules are trivial. 

Once such a trivial rules is recognized, next thing to do is to forcus our 
consideration on some part of database from which truely significant rules are 
derived. It should be noted here that such rules never hold for the original 
database so they must be exceptional. So we can conclude that the combination 
of the abstraction method and the techniques for discovering exceptional rules 
will lead us to more powerful and meaningful mining algorithm. 
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Abstract. We have developed an interactive production system archi- 
tecture to simulate collaborative hypothesis testing processes, using the 
Wason’s 2-4-6 task. In interactively solving situations two systems find 
a target, conducting experiments alternately. In independently solving 
situations, each of two systems finds a target without interaction. If the 
performance in the former situations exceeds in the latter situations, we 
approve of “emergence”. The primary results obtained from computer 
simulations in which hypothesis testing strategies were controlled are: 
(1) generally speaking collaboration neither provided the benefits of in- 
teraction nor caused emergence when only the experimental space was 
shared. (2) As the different degree of strategies was larger, the benefits 
of interaction increased. (3) The benefits came from complementary ef- 
fects of interaction. That is, disadvantage of one system that used an 
ineffective strategy was supplemented by the other system that used an 
advantageous strategy. In a few cases we approved of emergence, the com- 
plementary interaction of two systems brought a supplementary ability 
of disconfirmation. 



1 Introduction 

The research of human discovery processes has its long history of studies. One of 
major research themes of the studies is hypothesis testing. In psychological ap- 
proaches, researchers have conducted psychological experiments, using relatively 
simple experimental tasks. The Bruner’s concept-learning task |p, the Wason’s 
2-4-6 task |2|, and New Elusis js] are well known examples of the tasks. By 
using these tasks, psychologists try to simulate human discovery processes in 
an experimental room, and obtain empirical knowledge on human thinking and 
reasoning processes. 

Recently some researchers have begun to pay attention to collaborative dis- 
covery processes, using the same tasks that they have used for a long period. In 
this paper, we will investigate collaborative discovery processes based on com- 
puter simulations, using the same task that psychologists have used. By doing 
so, we will be able to compare the data of computer simulations in this paper 
with knowledge obtained through the psychological approaches. 

An important key concept of collaboration is “emergence” . The performance 
in the case of two systems solving a problem is sometimes better than in the case 
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of one system solving a problem. However, we do not admit this improvement 
as emergent phenomenon. 

Let us consider the following two situations. 

— System A and System B solve a problem independently without interaction. 
If at least one of them reaches the solution, then we judge the problem is 
solved independently. 

— System A and System B solve a problem while interacting with each other. 
If at least one of them reaches the solution, then we judge the problem is 
solved collaboratively, that is, the group (System A and System B) solves 
the problem together. 

The former corresponds to a situation in which two persons independently 
solve a problem in deferent rooms and then two answers by the two persons are 
checked up. The latter corresponds to a situation in which two persons solve a 
problem while talking with each other. 

When the performance in the latter case exceeds in the former case, we 
approve of emergence because the improvement absolutely depends on the in- 
teraction between the two systems. 

There have only been a few psychological studies indicating the benefits of 
collaboration based on the criterion above. A rare case was indicated in Okada & 
Simon, 1997 Q. Emergent phenomenon defined above has been rarely observed. 
In this study, two production systems interactively solve a discovery task. Each 
system forms hypotheses about a target rule and conducts experiments. Sys- 
tems revise hypotheses based on feedback from experiments. Let us consider two 
situations (see Figure 1). 



System A 




System B 


System A 




System B 


hypothesis^ 






hypothesis^ 


d. 


lypothesis 




experiment 




experiment 


experiment ^ 




hypothesis ^ 


d 


hypothesis 


hypothesis^ 


\ 

d 


hypothesis 




experiment ^ 




experiment 


experiment ' 




hypothesis^ 




hypothesis 


hypothesis^ 


\ 

d 


hypothesis 




experiment 




experiment 


experiment ^ 




hypothesis ^ 




hypothesis 


hypothesis^ 


\ 

d 


hypothesis 




experiment ~ 




experiment 


experiment ^ 




hypothesis ^ 




hypothesis 


hypothesis^ 


\ 

d 


hypothesis 




experiment 




experiment 


experiment ' 




hypothesis ^ 




hypothesis 


hypothesis^ 


/ 


hypothesis 


(a) interactive situation 


(b) independ 


snt situation 





Fig. 1. Interactively solving situation and independently solving situation. 



— Two systems carry out experiments independently, and the final results that 
the two systems reach are checked up. In this case, there is no interaction 
between the two systems (independently solving situations). 
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— Each system conducts experiments alternately (that is, a first experiment is 
carried out by System A, second by System B, third by System A, fourth 
by System B, and so on), and receives both results of self-conducted and 
other-conducted experiments. In this case, interaction between two systems 
exists (interactively solving situations). 

Note that in the interactively solving situations above, the reference is re- 
stricted to only accepting experimental results; one system does not know hy- 
potheses that the other system has. So two systems share the experimental space 
(database) only, do not share the hypothesis space j^. Moreover this collabo- 
ration needs neither additional working memory capacity nor new production 
rules. The two systems only exchange mutual experimental results. 

Our question is whether or not emergence occurs in this type of collaboration. 
If we observe emergent phenomenon in this collaboration, we find an important 
function of collaborative problem solving because we can obtain some benefits 
from interaction without any additional cognitive abilities. 

2 2-4-6 Task and Hypothesis Testing Strategy 

As we mentioned before, in the empirical studies of collaborative problem solving, 
many psychological experiments have been conducted, using relatively simple 
(puzzle-like) tasks |7] Q. The Wason’s 2-4-6 task is one of the most popular 
tasks among them jSj. This task has been widely and continuously used in the 
psychological studies of hypothesis testing strategies 0. In this study, we also 
utilize the 2-4-6 task. 

The standard procedure of the 2-4-6 task is as follows. Subjects are required 
to find a rule of relationship among three numerals. In the most popular situa- 
tion, a set of three numerals, [2, 4, 6], is presented to subjects at the initial stage. 
The subjects form hypotheses about the regularity of the numerals based on the 
presented set. Three continuous evens, three evens, and ascending numerals are 
examples of hypotheses. 

Subjects conduct experiments by producing a new set of three numerals and 
present them to an experimenter. This set is called an instance. An experimenter 
gives Yes feedback to subjects if the set produced by subjects is an instance of 
the target rule, or No feedback if it is not an instance of the target. Subjects 
carry out continuous experiments, receive feedback from each experiment, and 
search to find the target. 

The most basic distinction of hypothesis testing is the positive test (Ptest) 
and the negative test (Ntest) j^. Ptest is experimentation using a positive in- 
stance for a hypothesis, whereas Ntest is experimentation using a negative in- 
stance. For example, when a subject has a hypothesis that three numerals are 
evens, an experiment using an instance, [2, 8, 18], corresponds to Ptest, and an 
experiment with [1, 2, 3] corresponds to Ntest. Note that the positive or neg- 
ative test is defined based on a subject’s hypothesis, on the other hand. Yes 
or No feedback is on a target. Figure 2 indicates the pattern of hypothesis re- 
construction based on the combination of a hypothesis testing strategy and an 
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experimental result (Yes or No feedback from an experimenter). When Ptest is 
conducted and No feedback is given, the hypothesis is disconfirmed. Another 
case of disconfirmation is the combination of Ntest and Yes feedback. On the 
other hand, the combinations of Ptest - Yes feedback and Ntest - No feedback 
confirm the hypothesis. 




target 



Yes No 



confirmation 


disconfirmation 


disconfirmation 


confirmation 



Fig. 2. Patter of hypothesis reconstruction. 

Both philosophical and psychological studies have stressed the importance of 
disconfirmation of hypotheses; disconfirmation is considered as an opportunity 
of revising a hypothesis M- Excellent hypothesis testing strategies are regarded 
as strategies providing as many opportunities for disconfirmation as possible. 

Task analyses, psychological experiments, and computer simulations give us 
consistent knowledge on good strategies for hypothesis testing 0 El - Ptest 
is effective when a target that subjects must find is specific; on the other hand, 
Ntest is effective when a target is general. The degree of specificity is defined 
based on the density of target instances to all possible instances. For example, 
“three continuous evens” is more specific than “three evens”, which is more 
specific than “the first numeral is even” . 

3 Interactive Production System 

We have developed an interactive production system architecture for simulating 
collaborative discovery processes (see Figure 3). 

The architecture consists of five parts; production sets of System A, produc- 
tion sets of System B, the working memory of System A, the working memory of 
System B, and a common shared blackboard. Two systems interact through the 
common blackboard. That is, each system writes elements of its working memory 
on the blackboard and the other system can read them from the blackboard. 

The systems have knowledge on the regularities of three numerals. The knowl- 
edge is organized as the dimension- value lists. For example, “continuous evens”, 
“three evens” , and “the first numeral is even” are example values of a dimension, 
“Even-Odd”. The dimensions the systems use are: Even-Odd, Order, Interval, 
Range of digits. Certain digit. Mathematical relationship. Multiples, Divisors, 
Sum, Product, Different. The dimension-value lists organize the system’s hy- 
pothesis space. 
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productions 

knowledge 
strategies 

^^ognitive cycl^^ 

hypothesis ^ 
experimental 



productions 



results 



working memory 



System A 



hypothesis 
experimental 
results 

common blackboard 



knowledge 

strategies 



^^ognitive cycl^ 



hypothesis ' 
experimental 
results 



working memory 
System B 



Fig. 3. Basic architecture of interactive production system. 



Basically the systems search the hypothesis space randomly, however three 
hypotheses, “three continuous evens”, “the interval is 2”, and “three evens” are 
particular. Human subjects tend to generate these hypotheses at first when the 
initial instance, “2, 4, 6” is presented nm So our systems also generate these 
hypotheses first prior to other possible hypotheses. 

In computer simulations, two factors are controlled. In the following section 
4., hypothesis testing strategies are controlled. In section 5., in addition to hy- 
pothesis testing strategies, hypothesis formation strategies are also manipulated. 

4 Collaboration Using Different Hypothesis Testing 
Strategies 

First, we discuss collaboration of two systems, each of which has a different 
hypothesis testing strategy. 

Table 1 shows an example result of the computer simulations. The target was 
“Divisor of three numerals is 12” . 



Table 1. An example of computer simulations. 



Hypotheses by System A 


Experiments 


Hypotheses by System B 








1 


2, 4, 6 


Yes 










2 


Continuous evens numbers. 


0 


3 


4. 6. 8 


No 


Ptest by SysA 




- 




4 


The product is 48. 


0 


6 


6, 6, -17 


No 


Ntest by SysB 


5 


The sum is a maltiple of 4. 


0 


8 


The product is 48. 


1 


9 


24, -1. -2 


No 


Ptest by SysA 


7 


The sum is a maltiple of 4. 


1 


10 


First + Second = Third. 


0 


12 


3, -8, -20 


No 


Ntest by SysB 


11 


The sum is a maltiple of 4. 


2 


14 
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Two systems interactively found the target. One system, System A, always 
used Ptest in its experiments, and the other, System B, used Ntest. The table 
principally consists of three columns. The left-most and right-most columns in- 
dicate hypotheses formed by System A and System B respectively. The middle 
column indicates experiments, that is, generated instances. Yes or No feedback, 
and the distinction of Ptest or Ntest conducted by each system. Each experiment 
was conducted alternately by two systems, and the results of the experiments 
were sent to both of the two systems. The left-most number in each column indi- 
cates a series of processing, from through #41. The right-most number in the 
left-most and right-most columns indicates the number of each hypothesis being 
confirmed. System A disconfirmed its hypotheses at #4, #10, #16, which were 
introduced by self-conducted experiments at #3, #9, #15. System B discon- 
firmed its hypotheses at #17, #29, which were introduced by other-conducted 
experiments at #15, #27. 

In the following computer simulations, we let two systems find 35 kinds of 
targets. Examples of the targets are: three continuous evens, ascending numbers, 
the interval is 2, single digits, the second numeral is 4, first numeral times second 
numeral minus 2 equals third numeral, multiples of 2, divisors of 24, the sum is 
a multiple of 12, the product is 48, three different numbers. For each target, we 
executed 30 simulations to calculate the percentage of correct solutions. Each 
system terminates the search when they pass four continuous experiments, that 
is four confirmation situations continues. The last hypothesis each system forms 
was compared with the target, and success or failure (a correct or incorrect 
solution) was checked. 

We compared the average performance for finding the 35 targets in the fol- 
lowing four situations. (A) A single system. System A, finds the targets. (B) 
System B finds the targets. (C) Dual systems. System A and System B, find 
the targets independently. (D) Dual Systems, System A and System B, find the 
targets interactively. In the following figures, we indicate these four situations 
as SysA, SysB, SysA -I- SysB, and SysA <=> SysB, respectively. 

In the computer simulations, we also consider the combinations of hypothesis 
testing strategies that two systems use. We utilize three kinds of strategies: Ptest, 
Ntest, and Mtest strategies. The Ptest strategy always uses Ptest in experiments, 
the Ntest strategy always uses Ntest, and the Mtest (Medium test) strategy 
uses Ptest in the half of experiments and uses Ntest in the other half. The 
combinations of these three strategies are 6 cases ((a) Ptest vs. Ptest, (b) Ntest 
vs. Ntest, (c) Mtest vs. Mtest, (d) Ptest vs. Mtest, (e) Ntest vs. Mtest, and (f) 
Ptest vs. Ntest). 

Figure 4 shows the results of computer simulations in the 6 cases. In the 
figure, predominance of the independent situations is indicated as (p < 
.05) and “**” (p < .01). “n.s” indicates no significant difference. On the other 
hand, in the following figures, predominance of the interactive situations will be 
indicated as “#” (p < .05) and “## ” (p < .01) (see, for example. Figure 6). 
The results are summarized as follows. 
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Fig. 4. Combination of hypothesis testing strategy and the average ratio of cor- 
rect solution. 



— The performance of dual systems was better than that of a single system. 

— The performance of two interacting systems was not so different from the 
performance of two independent systems. Rather independent solutions were 
often better than interactive solutions. So we could not find emergent phe- 
nomenon. 

Next, we discuss the relation between the different degree of strategies of two 
systems and their total performance. 

We controlled each ratio of conducting Ptest in experiments while keeping 
the total ratio of Ptest at 50%. That is, we set up the following 5 cases where each 
ratio of Ptest was controlled as follows: (a) 50% (System A) and 50% (System 
B), (b) 38% and 62%, (c) 25% and 75%, (d) 13% and 87%, and (e) 0% and 
100%. In the 50% and 50% situation, both systems use the Mtest strategy, in 
which the difference of strategies is nothing. On the other hand, in the 0% and 
100% situation, one system uses the Ntest strategy and the other uses the Ptest 
strategy, in which the different degree of strategies is the maximum. 

Figure 5 shows the performance (the average percentage of correct solutions 
in finding the 35 targets), and Figure 6 shows the number of disconfirmation, and 
the number of formed hypotheses, comparing interactively and independently 
solving situations. The results are as follows. 

— In the independent situations, the performance was almost constant regard- 
less of the transition of the different degree of strategies. And the number 
of disconfirmation, and the number of formed hypotheses were getting small 
as the different degree of strategies increased. 

— On the other hand, in the interactive situations, as the different degree of 
strategies increased, the total performance was getting high, the number of 
disconfirmation and the number of formed hypotheses were getting larger. 
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Fig. 5. The different degree of hypothesis testing strategy of the two systems 
and the ratio of correct solution. 
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the number of total formed hypotheses 
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Fig. 6. The different degree of hypothesis testing strategy and the number of 
disconfirmation, the number of formed hypotheses. 



— Two experimental results above indicated that making two collaborative 
systems use different strategies brought the benefits of interaction. However, 
even when the different degree of strategies was the maximum, we could not 
admit the superiority of interacting situations to independent situations in 
terms of the performance. 
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To investigate the occurrence pattern of disconfirmation, we divided the 35 
targets into 18 specific targets and 17 general targets. In the interactively solv- 
ing situations, when each system used the Mtest strategy (the 50% and 50% 
situation), the occurrence patterns of disconfirmation of the two systems were 
almost the same because both systems used the same strategy (see Figure 7 (b)). 
However, in the 0% and 100% situation, there was a big difference (see Figure 
7(a)). 




sysA sysB sysA sysB 
specific general 

(a) Ptest (sysA) vs. Ntest (sysB) 




sysA sysB sysA sysB 
specific general 

(b) Mtest (sysA) vs. Mtest (sysB) 



n 

other 



Fig. 7. The number of disconfirmation in interactive situation. 

When the problem was to find specific targets. System B that used the Ntest 
strategy disconfirmed its hypotheses more by other-conducted experiments than 
by self-conducted experiments; on the other hand System A that used the Ptest 
strategy did more by self-conducted experiments. When finding general targets, 
this tendency was reversed. 

In 2., we pointed out that the Ptest strategy promoted disconfirmation when 
finding specific targets; Ntest promoted it when finding general targets. So the 
results above show: 

— When two systems use different strategies, a system that uses a disadvanta- 
geous strategy disconfirms its hypotheses by experiments conducted by the 
other system that uses an advantageous strategy. This complimentary inter- 
action produces additional abilities for disconfirming hypotheses, and brings 
the superiority of two systems using different strategies. 



5 Collaboration Using Different Hypothesis Testing 
Strategies and Different Hypothesis Formation 
Strategies 

In the previous section, we indicated that the benefits of interaction increased as 
the different degree of hypothesis testing strategies was larger. However, emer- 
gent phenomenon was not distinct enough. 
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The next question is as follows: if we combine much more differential of other 
problem solving strategies with hypothesis testing strategies, do we find emer- 
gent phenomenon? And if we find it, do we confirm the complemental effects of 
interaction indicated in the previous section in those cases? So we conducted ad- 
ditional computer simulations, controlling hypothesis formation strategies with 
hypothesis testing strategies. 

We controlled three kinds of hypothesis formation strategies ((a) Gform: gen- 
eral hypothesis generation has priority, (b) Sform: specific hypothesis generation 
has priority, and (c) Rform: hypotheses are randomly formed) in addition to the 
three hypothesis testing strategies (Ptest, Ntest, and Mtest). The combinations 
of these strategies are 45 cases shown in Table 2. 

Table 2 shows the results of computer simulations. The cases in which we 
admitted the superiority of interacting situations were only four cases among a 
total of 45 combinations (p < .01). We definitely approve of emergence in these 
cases. 



Table 2. Comparison of the ratio of correct solution and the combination of 
hypothesis testing strategy and hypothesis formation strategy. 
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In Table 2, two pairs of adjoining combinations (Ntest-Gform vs. Ntest-Sform 
and Ntest-Gform vs. Ntest-Rform; Ntest-Gform vs. Ptest-Sform and Ntest-Gform 
vs. Ptest-Rtest) indicated similar tendencies in the following analysis, so we 
present two representative cases: Ntest-Gform vs. Ntest-Sform and Ntest-Gform 
vs. Ptest-Sform. Figure 8 shows the occurrence patterns of disconfirmation in 
the two cases, whose format is the same to Figure 7. We basically confirmed 
the complementary effects of interaction mentioned in 4, even though we cannot 
verify predominance of disconfirmation by other-conducted experiments only in 
finding general targets in Figure 8(b). 
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Fig. 8. Comparison of the number of disconfirmation in interactive and inde- 
pendent situations. 



6 Conclusion 

We have empirically discussed whether interaction between two production sys- 
tems produces emergent phenomenon when the two systems share the experi- 
mental space only. The results are summarized as follows. 

— Generally speaking, collaboration neither provides the benefits of interac- 
tion nor causes emergent phenomenon when only the experimental space is 
shared. 

— As the different degree of strategies is larger, the benefits of interaction in- 
crease. The benefits come from complementary effects of interaction; that is, 
disadvantage of one system that uses an ineffective strategy is supplemented 
by the other system that uses an advantageous strategy. 

— In a few cases we approved of emergence, the complementary interaction 
of two systems actually brought a supplementary ability of disconfirmation. 
This indicates the possibility of gaining additional benefits through interac- 
tion of two systems without additional cognitive abilities. 

For good collaboration we should receive effective information from the other. 
To do so we intuitively think that we try neither to ignore information from the 
other nor to alter the degree of importance between its own information and 
information brought by the other. 

Klayman & Ha pointed out based on their sophisticated analyses that pri- 
mary factors for acquiring effective information from the external environment 
were constrained by the relation between a target and a hypothesis 0. From this 
viewpoint they discussed the importance of hypothesis testing strategies. Our 
computer simulations also support this point. Even though a system concen- 
trates on not ignoring information brought by the other system, if the combina- 
tion of strategies of two systems does not construct a complimentary interacting 
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situation, the interaction does not bring emergence. The key factor of receiv- 
ing effective information from the other system is not cognitive bias towards the 
other system, but the relation of strategies both systems use. Only if the relation 
constructs a particular situation in which each system can get mutually effective 
information from the other, the possibility of producing emergence remains. 
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Abstract. Most Internet users use Search Engines to get information 
on the WWW. However, users cannot be content with the output of 
search engines because it’s hard to express the user’s own interest as a 
search query. Therefore, we propose a system which provides users with 
keywords related to user interest. Users can recall keywords express- 
ing user’s hidden (not described in a search query) interest as search 
keywords. Previous rehnements are simple modihcations of the search 
queries, for example, adding narrow keywords or relational keywords. 
However, the restructuring described in this paper means replacement of 
the search keywords to express user interest concretely. 



1 Introduction: WWW and Information Retrieval 

When Internet users search for information, most users use a search engine to 
search for Web pages. When one uses search engines, it is difficult to get pages 
related to one’s interests at once. This is because; 

1. The search keywords are incomplete because of a user’s ignorance or a lapse 
of memory. 

2. The search keywords are ambiguous to specify user interest, because users 
have a tacit interest. 

3. A word has various meanings. 

4. A topic has various expressions. 

In other words, a user doesn’t know both his/her own interest and the pages 
that exist in the WWW. There is a gap between user’s knowledge and Web pages. 
(Problems 3 and 4 are called the Vocabulary ga,p [IEuruas87] ) For overcoming 
these problems, some related works which provide a user with keywords exist. 

The system [Thesaurus-step2| . makes up search queries by using a thesaurus. 
However, synonyms aren’t always used in Web pages because the thesaurus is 
made separately from Web pages. The words used in real Web pages are more 
useful than other related keywords. 

The search engine Hcaa.u-Mondou [MondouJ provides a user with narrow key- 
words extracted from Web pages matched with the user’s original search query. 
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However, unless a user inputs proper search keywords initially, effective keywords 
are not supplied and users can’t narrow information. Along with this, keywords 
expressing user interest are also in other Web pages. 

The search engine Alta, Vista, j Alta, Vistaj provides a user with noun phrases in- 
cluding search keywords. However, related words don’t always occur as a phrase. 
The search engine Excite |Excite| provides a user with related keywords of each 
search keyword. However, related keywords are useful for expanding the range 
of a topic rather than to narrow information. Actually, both engine use search 
keywords as OR conditions. Therefore, the work to narrow information depends 
on the search engine and users can’t get Web pages the users really want. 

SSIE |5unayama^ provides both narrow keywords and related keywords. 
Related keywords are extracted from Web pages that include search keywords 
which are tend to occur simultaneously in many Web pages. However, a user 
interest doesn’t always match the public interest. Namely, a user interest is in 
every word in a search query. Therefore, the system described in this paper 
obtains relational keywords related to each partial interest (sub-query). 

Previous refinements are simple modifications of the search queries, for exam- 
ple, adding narrow or relational keywords. However, the restructuring described 
in this paper means replacement of the search keywords to express user interest 
concretely. This restructuring is due to the Boolean search expression. Logical 
operations AND and OR make structure of the search keywords and optimal 
structure of the search keywords will be composed by the user with this system. 

There are other approaches for information retrieval. [Ik riilwichqoj showed a 
method to learn proper words for search by inductive learning from a history of 
a user’s action. Learning user interest as in |t5riiza9dl II llisawaP y| is useful for the 
users who search in restricted database or topics. However, the whole WWW 
includes various topics and a user interest is so various that it’s hard to learn 
user’s temporary interest with a few histories of the search. 

In this paper, we propose a Aiding System for Discovery of User Interest, 
ASDUI for short. Fig ^ shows an overview of ASDUI. ASDUI consists of three 
parts, that is; the right frame is for extracting interest (relational) keywords, the 
central frame is for extracting narrow keywords and the left frame shows an 
interface to restructure search query. 

Finally, the user restructures search queries by user’s own interests because 
the system cannot infer user interest strictly, in other words, the user knows 
user’s own interests better than the system. This means an important phase is 
devoted to the user, as in the framework of KDD I jbayyaduS] . Therefore, ASDUI 
should have an interface by which the user can restructure search queries easily. 

In the rest of this paper, we explain ASDUI for user interest expression. 
In section 0 we show the method how to obtain interest keywords. Section 
0 describes the interface for users to restructure search query. Section ^ gives 
some experimental results with the restructuring of search queries and section ^ 
concludes this paper. 
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Fig. 1. Aiding System for Discovery of User Interest 



2 Interest Discovery System for User Interests Expression 

In this section, we describe the Interest Discovery System for user interest ex- 
pression, IDS for short. The right frame in FigQshows the process of IDS. 

A user’s search query is input to IDS, which catches the hidden (not included 
in the search query) user interest. Then, Web pages related to user interest are 
retrieved from the WWW, and appropriate keywords for expressing interest 
are supplied to the user. Namely, IDS discovers the user’s hidden interest by 
extracting keywords from real Web pages. 

2.1 Input: Search Query 

A search query here is given as a boolean which is the combination of search 
keywords by using the logical operations “AND” and “OR”. This search query 
is input to a search engine. Presently, most search engines need search keywords 
as AND-conditions, However, some concepts (e.g. computer) are expressed in 
different words (e.g. “machine” OR “hardware”) depending on the Web page. 
Therefore, a boolean approach is more appropriate for expressing user inter- 
est concretely and precisely. Besides, recent search engines support the boolean 
input. We establish one assumption for the search queries. 

Assumptiou:A search query is described by AND-conjunctions and 
each condition of the AND-conjunction is described by one OR-conjunction. 

In other words, this assumption says users have only one whole interest and users 
prepare search keywords related to each divided partial interest. Because a user 
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who wants to search Web pages thinks one thing in his/her minds. Therefore, 
search keywords should be combined into an AND-conjunction. Along with this, 
each keyword may be able to be replaced with an another word, such as a 
synonym or more a concrete word, so such a replacement of words is expressed 
as an OR-conjunction. For example, a user who wants to go sightseeing in Kansai, 
a western part of Japan, has two search keywords: “Kansai” and “sightseeing”, 
which are combined in an AND-conjunction. The user thinks of search keywords 
related to each word and a search query like “(Kyoto OR Nara OR Kobe) AND 
(Sightseeing OR The sight)” is composed. 

2.2 Division of Search Query 

A search query (user input) is divided to partial search queries. Namely, a search 
query is divided at the points of the “AND” . For example, a search query “(Kyoto 
OR Nara OR Kobe) AND (Sightseeing OR The sight)” is divided to two partial 
queries “(Kyoto OR Nara OR Kobe)” and “(Sightseeing OR The sight)”. Each 
partial query represents the user’s partial interest because of the relationship 
between the user’s mind and the search query as in the I2~T1 That is, when a 
user has an interest, that interest is divided into some partial interests. A partial 
interest is an element of user interest. Since the system can only see the search 
query composed by the user, the support system should take a user’s hidden and 
deep partial interests, for expressing user interest finely by words which can be 
used for the search. 

2.3 Acquisition of Web Pages from a Partial Interest 

To get keywords for search, our system obtains Web pages related to the user’s 
partial interests. We apply the traditional idea, discovery from simultaneity of 
events |Langley57| |bunayama97| . Since search keywords in a search query are 
used simultaneously, all search keywords depend on one another or on a common 
user (partial) interest. 

The method of getting Web pages is quite simple; IDS calls a search engine. 
A search query is given as an AND-conjunction which includes keywords used 
in a partial user interest. That is, for examining the reason why those search 
keywords were used simultaneously, our system gets the pages including all those 
keywords. For example, when a partial interest is a “Sightseeing OR The sight”, 
“Sightseeing AND The sight” is composed as a search query and given to the 
search engine. Web pages retrieved for this query can be expected to express the 
content of user interest. 

The pages used to extract keywords number only 20 pages. However, these 
20 pages are high-ranked by the frequency of the search keywords, and keywords 
won’t be extracted but useful for search will appear in the later search. 

2.4 Keyword Extraction from Web Pages 

The Web pages obtained in PT3I may be constructed by the same interest as the 
user’s, so such pages may include a word expressing user interest or related to 
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user interest. Therefore, the system is expected to extract interest keywords from 
those Web pages. 

To put this into practice, first we extract nouns from the Web pages matched 
with the search query. The Web pages and keywords we used were only in 
Japanese through all our experiments. We extracted nouns by morpheme analy- 
sis but some inproper words like “home” , “page” , used frequently but not mak- 
ing sense, are excluded in advance. Our system uses ChaSen IOhaben| . for the 
Japanese morpheme analysis system. The extracted nouns, keyword candidates, 
are evaluated by the value given by equation m- In IDS, the value of word A is 
given as follows. 



value(A) = log(l -I- t/(^, d)). (1) 

Du is the document set retrieved in l 2 .;tl a.nd tf{A, d) is the frequency of word 
A in document d. This value will be large if a word frequently appears in the set 
of all documents obtained by a search query. This evaluation function is different 
from the ordinary function like a tfidf | ^alton97| . Since we want keywords which 
are prevalent in the area of user interest, i.e. the area represented by documents 
Du, keywords commonly used are better than keywords used in a few documents 
in Du- As a result, these interest keywords represents user’s partial interests. 
Finally, the set of interest keywords common to all partial interests becomes 
user’s whole interest. Furthermore, to get more related keywords, the evaluation 
of the keywords are revised as in the following steps. 

1. Select 100 nouns for each partial interest by function (0. 

2. If a noun A is included in multiple partial interests, value(A) is revised 
to value(A) * number (A), (number (A) is the number of partial interests 
including the word A.) 

3. Select 20 nouns by the revised values of nouns. 

By using this revision, keywords related to a partial interest is reduced by other 
keywords related to other partial interests so that the keywords related to whole 
user interest are selected. 

3 An Interface for Restrnctnring the Search Qnery 

In this section, we describe our interface to restructure the search query by using 
keywords expressing user interest. 



3.1 Keywords to Support Searching 

In ASDUI, two kinds of keywords are supplied to a user; 

1. Narrow keywords: Keywords extracted from the Web pages matched with 
the user’s original search query. 
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2. Interest keywords: Keywords extracted from the Web pages matched with 
search query prepared by user interest hypothesis (see 12.41 . 

Narrow keywords are extracted by the same evaluation function as in Eq.dQ. 
Namely, a value is given to every noun in the Web pages matched with user’s 
original search query. Then, high scoring keywords are extracted as narrow key- 
words. Thus, these narrow and interest keywords are supplied to the user. The 
number of narrow keywords and interest keywords are fixed both to 20 in the 
current system. Q 

3.2 Restructuring of Search Queries for Improving Results 

Before we show the interface of ASDUI, we explain how to restructure search 
queries. There are two traditional criteria, precision and recall, to measure search 
performance. Precision is the ratio of pages matched with user’s request of all 
output pages, and recall is the ratio of output pages matched with user’s request 
of all pages the user wanted. These criteria are described in the following equa- 
tions by the symbol A, the set of pages a user wants to get, and B, set of pages 
retrieved by the search engine. 



Precision = 



\Ar\B\ 

\B\ 



Recall = 



\AnB\ 



(2) 

(3) 



Here, j^l denotes the number of elements in set S. The following patterns of 
the restructuring of the search query improve precision and recall. 



— Improve the precision 

1 . Add a narrow keyword to search query as a new AND condition 

2. Delete a word from a search query chained which was combined as an 
OR condition 

3. Add a narrow keyword as NOT condition^ 

— Improve the recall 

4. Add a interest keyword to the search query as an OR condition 

5. Delete a word from a search query which was combined as a AND con- 
dition 

— Improve both the precision and the recall 

6. Replace a word with plural interest /narrow keywords 

7. Replace words with a single interest/narrow keyword 

The user decides which restructuring to take and which keyword to use, by 
the user’s own interest not described in initial search query, because the system 
does not have a precise idea about the content of user interest. The user has 
many tacit interests in his/her mind, user interest should be gradually made 
concrete by the supplied keywords. 

^ The number of each keyword 20 is decided by the experimental results to choose 
keywords easily and to appear related keywords moderately. 

^ NOT condition is a condition where the pages including the NOT word won’t appear. 



74 



Wataru Sunayama, Yukio Ohsawa, and Masahiko Yachida 



3.3 The Interface Overview of ASDUI 

FigH shows the interface to restructure a search query. The two square frames 
in Fig0 depict the user’s partial interest respectively, and each search keyword 
entered by the user is in one frame. The keywords marked with are narrow 
keywords, and the keywords under the frames are interest keywords. The newly 
obtained (by ASDUI) interest keywords related to each partial interest are ar- 
ranged right under each frame, and keywords included in the conjunction of new 
partial interest are placed downwards. In this interface, a user can restructure 
the search queries by dragging a keyword with the mouse. If a user pushes the 
“FRAME” button in the upper-right, one frame is added. Each frame is com- 
bined by an AND condition and keywords in a frame are combined by an OR 
condition to express user interest. The frame in the lower-left is for a NOT con- 
dition. Keywords included in this frame is treated as a NOT condition. Then, 
a search query displayed at the top of the interface is given to the component 
search engine by pushing the “SEARCH” button. 

4 Experiments: Discovering Interests and Supplying 
Keywords 

We conducted some experiments to evaluate our support system. Our system 
is realized on the WWW using C, perl and Java. The machine for server is a 
DOS/V (CPU Pentiumll 350MHz, 128MB, Linux). The database includes 8000 
Web pages about public entertainment and sports in 1997-1999 downloaded from 
the WWW izakzakI and so on. Note that these experiments were held by Japanese 
students in Japanese, so the keywords shown in this section are translated from 
Japanese to English. 

4.1 Interest Interpretation by the Interface 

One user had an interest to know about Ryoko Hirosue, and he/she (the exper- 
iment was conducted blindly) entered the search query “Hirosue0 AND Univer- 
sit50’ . By the input of this search query, 27 pages were retrieved and a restruc- 
turing interface like Fig|2|was output. 

In this interface, “Hirosue” and “University” are in each frame as the partial 
user interest. Under the each frame, keywords related to each partial interest are 
displayed. For example, under the frame “Hirosue”, “Ryoko” which is her first 
name, “Waseda Univ.” which is her school, and other interest keywords appear. 
Under the other frame “University” , related to only university is appeared. 

Now, we mention about the four keywords at the bottom of the interface 
of Figg] These four keywords were included in both partial interest keywords. 



^ Ryoko Hirosue is a Japanese top star, who is a singer and an actress. Actually, the 
number of Web pages including her name is the greatest of all the stars in Japan. 
Ryoko entered Waseda University this past April. 
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Fig. 2. Restructuring interface 1 



Therefore, these four keywords are interpreted as the user’s whole interest ex- 
pressed by the input search query. Along with this, supplied keywords represent 
the database, so these keywords can fill the gap between user’s knowledge and the 
database. Therefore, the interpretation of these keywords is as; Ryoko Hirosue, 
she was a high school student, has passed the entrance exam of the university, 
and some comments by the people concerned exist. 



4.2 The Role of the Keywords 

As we can interpret user interest, the interface like Fig.0has an advantage that; 
The role of each keyword for the user becomes explicit, i.e. some keywords rep- 
resent partial interest and some keywords represent whole interest. 
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In the interface, narrow keywords are above the frames and in partial interests 
and in whole interest. Therefore, a user can narrow information to the direction 
the user wants. Besides, a user can know what pages are hit by a search query. 
In this case, no narrow keyword is in the interest keywords of “University”, so 
the pages only about Hirosue were output. 

Finally, interest keywords are useful for expanding user interest. Each in- 
terest keyword includes synonyms of the keywords in each frame. Such interest 
keywords are preferably in the pages hit by search query for partial interest. For 
these reasons, keywords both included in interest and narrow keywords are more 
useful for search. 

4.3 The Restructuring of a Search Query 

The user of FigO can make his interest concrete by using displayed keywords. 
So, the user restructured the search query by adding the keyword “Ryoko” into 
the frame “Hirosue” .(The restructuring 4 in VS.'zi) This means that user can get 
the pages using not only “Hirosue” but also “Ryoko” because these keywords 
are combined by an OR-condition. Then, the user used the words “Waseda” and 
“Waseda Univ.” in place of “University” . (The restructuring 6) Namely, the user 
could find the name of university and concretely expressed. Finally, the user 
added “Entrance ceremony” as an AND condition like Fig.0(The restructuring 
1) The user could make his/her interest concrete by using supplied keywords. 

The operations to restructure search query are; First, drag the keywords 
“Ryoko”, “Waseda”, “Waseda Univ.” to the proper frames and drag “University” 
out. Second, push the “FRAME” button to create a new frame, and drag the 
keyword “Entrance ceremony” to it with the mouse. Thus, a new search query is 
obtained and described at the top of the interface. Finally, the user pushes the 
“SEARCH” button to search again, for the new search query. After the second 
search, the user got 10 pages and found the news about the Ryoko’s entrance 
ceremony easily. 



lirosue OR Ryoko) AND (Uaseda Univ. OR Uasedq) AND Entronce ^ [ SEARCH ) [ FRAME 

f-z — ] I *Department of education 1 rrr— 1 [Univ^fsity] 

|*Fan| I 5- 1 |*Takeanoxam| 



|*Ryoko| 



[Entrance ^xanr] 

|*Entrance into a school 1 | Graduation! 



I *Entrance ceremony] 



Fig. 3. Restructuring interface 2 
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4.4 Recalling New User Interest 

The interface after the second search is as Fig^ After the satisfaction of the 
first search, new user interest may be recalled by the supplied keywords. For 
example, one user may want to know the news when she passed the exam, and 
another user may want to know whether she will attend her classes or not. 

Then, the user wanted to other information outside of her campus life. So, 
the user excluded the keywords related to university by dragging the keywords 
into the NOT-frame. The keywords excluded were “Waseda Univ.”, “Waseda”, 
“Waseda University” , “University” , “Department” , “Department of Education” , 
“Entrance ceremony” and “Campus” .(The restructuring 3 in 




Fig. 4. Restructuring interface 3 
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The search query became “Hirosue OR Ryoko” with 8 NOT conditions. Af- 
ter the search, 129 pages were output (156 pages without NOT conditions). Us- 
ing the newly supplied keywords, “Kanako”, “Enomoto”, “Okina”, “Picture”, 
“Link” were also excluded as NOT conditions (The first three keywords are the 
names of other idols). In this interface, NOT conditions work very well, because 
keywords are extracted from many pages by the function m- Therefore, users 
can get rid of unuseful pages by the displayed keywords which represent the 
subset of the database. 

After this search with 13 NOT conditions, 70 pages were output and the re- 
structuring interface supplied the keywords, “Popularity”, “Drama”, “Appear- 
ance”, “CM”, “Idol”, “Sale”, “Hit”, “Debut”, “Actress”, “Fuji TV”, “Single” 
and “Okamoto” . Actually, she appeared in a Fuji TV drama. Her debut single 
was a hit and her second single was produced by M. Okamoto. Now that the user 
could get useful keywords, the user doesn’t have to use NOT conditions so that 
the user can get more pages about the topics. The user may make new search 
queries such as “(Ryoko OR Hirosue) AND Drama AND Appearance”. 

4.5 Features and Evaluations 

From the examples described above, ASDUI can be evaluated as follows; 

— Supply of keywords 

• ASDUI fills the gap between user’s interest and Web pages 

• Users can be reminded of keywords 

• Users can recall other related topics 

• Users can remove irrelevant pages 

— Interest keywords 

• Users can expand the range of information 

• Users can get keywords not depending on the number of hit pages 

• Narrow keywords can be classified by partial user interest 

— Interface 

• Users can restructure the search query 

• Keywords are classified by each role 

• User interest is expressed by the keywords in the pages. 

In order to verify these features, we evaluated questionnaires about the sys- 
tem. The results were provided by 16 people who are accustomed to using search 
engines. 72 search queries were input. For 71% of the search queries, they were 
able to discover keywords expressing their interest when shown the results of the 
system but which they could not enter at first. 63% of the people reported this 
interface is easier to restructure keywords with than simple search engines. That 
is, the related keywords arranged in the interface help users in finding/recalling 
appropriate keywords. Finally, for 68% of the search queries, new interests were 
created in the user’s mind by looking at the supplied keywords. 

Finally, although our database was much smaller than the real WWW, users 
can get interesting pages by using a NOT condition. After repetitions of restruc- 
turing, keywords which a user wants will appear. 
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5 Conclusion 

In this paper, we proposed a aiding system for discovery of user interest. This 
system supplies keywords representing user interest and the user can restructure 
the search query in a two dimensional interface by supplying keywords. Practi- 
cally, the more various interests people have, it will be more difficult for people 
to express his or her interests in words. Therefore, a support system such as 
ASDUI will be more effective in the future. 
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Abstract. Naive Bayes is a well known and studied algorithm both 
in statistics and machine learning. Bayesian learning algorithms repre- 
sent each concept with a single probabilistic summary. In this paper we 
present an iterative approach to naive Bayes. The iterative Bayes begins 
with the distribution tables built by the naive Bayes. Those tables are 
iteratively updated in order to improve the probability class distribution 
associated with each training example. Experimental evaluation of Itera- 
tive Bayes on 25 benchmark datasets shows consistent gains in accuracy. 
An interesting side effect of our algorithm is that it shows to be robust 
to attribute dependencies. 



1 Introduction 



Pattern recognition literature |Ej and machine learningp2| presents several ap- 
proaches to the learning problem. Most of them in a probabilistic setting. Sup- 
pose that P{Ci\x) denotes the probability that example x belongs to class i. The 
zero-one loss is minimized if, and only if, x is assigned to the class Ck for which 
P{Ck\x) is maximum p|. Formally, the class attached to example x is given by 
the expression: 

arginaxi P{Ci\x) ( 1 ) 

Any function that computes the conditional probabilities P{Ci\x) is referred to 
as discriminant function. Given an example x, the Bayes theorem provides a 
method to compute P{Ci\x): 



p{a\x) = 



p{Q)p{x\a) 

P{x) 



( 2 ) 



P{x) can be ignored, since it is the same for all the classes, and does not 
affect the relative values of their probabilities. Although this rule is optimal, 
its applicability is reduced due to the large number of conditional probabilities 
required to compute P{x\Ci). To overcome this problem several assumptions are 
usually made. 

Depending on the assumptions made we get different discriminant functions 
leading to different classifiers. In this work we study one type of discriminant 
functions, that leads to the naive Bayes classifier. 
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1.1 Naive Bayes Classifier 

Assuming that the attributes are independent given the class, P{x\Ci) can be 
decomposed into the product P(x\\Ci) * ... * P(x„|C'i). Then, the probability 
that an example belongs to class i is given by: 

p{c.\^) = ^^Y{p{^3n) ( 3 ) 

The classifier obtained by using the discriminant function]^ and the decision 
rule 0 is known as the naive Bayes Classifier. The term naive comes from the 
assumption that the attributes are independent given the class. 

Implementation Details. All the required probabilities are computed from 
the training data. To compute the prior probability of observing class z, P{Ci), 
a counter, for each class is required. To compute the conditional probability of 
observing a particular attribute-value given that the example belongs to class 
i, P{xj\Ci), we need to distinguish between nominal attributes, and continuous 
ones. In the case of nominal attributes, the set of possible values is a numerable 
set. To compute the conditional probability we only need to maintain a counter 
for each attribute- value and for each class. In the case of continuous attributes, 
the number of possible values is infinite. There are two possibilities. We can 
assume a particular distribution for the values of the attribute and usually the 
normal distribution is assumed. As alternative we can discretize the attribute 
in a pre-processing phase. The former has been proved to yield worse results 
than the latter BQ. Several methods for discretization appear on the literature, 
a good discussion about discretization is presented in BJ. In 0 the number of 
intervals is fixed to k = min(10; nr. of different values) equal width intervals. 
Once the attribute has been discretized, a counter for each class and for each 
interval is used to compute the conditional probability. 

All the probabilities required by equation B]can be computed from the train- 
ing set in one step. The process of building the probabilistic description of the 
dataset is very fast. Another interesting aspect of the algorithm is that it is easy 
to implement in an incremental fashion because only counters are used. 

Analysis of the Algorithm. Domingos and Pazzani B] show that this proce- 
dure has a surprisingly good performance in a wide variety of domains, including 
many where there are clear dependencies between attributes. They argue that 
the naive Bayes classifier is still optimal when the independence assumption is 
violated as long as the ranks of the conditional probabilities of classes given an 
example are correct. 

Some authors bilul refer that this classifier is robust to noise and irrelevant 
attributes. They also refer that the learned theories are easy to understand by 
domain experts, most due to the fact that the naive Bayes summarizes the 
variability of the dataset in a single probabilistic description, and assumes that 
these are sufficient to distinguish between classes. 
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Improvements. A few techniques have been developed to improve the perfor- 
mance of the naive Bayes classifier. Some of them are enumerated: 

1. The semi-naive Bayes classifier 0. Attempted to join pairs of attributes, 
making a cross-product attribute, based on statistical tests for independence. 
The experimental evaluation was disappointing. 

2. The recursive naive Bayes uni. An algorithm that recursively constructs a 
hierarchy of probabilistic concept descriptions. The author concludes that 
“the results are mixed and not conclusive, but they are encouraging enough 
to recommend closer examination. 

3. The naive Bayes tree 0. It is an hybrid algorithm. It generates a regular 
univariate decision tree, but the leaves contain a naive Bayes classifier built 
from the examples that fall at this node. The approach retains the inter- 
pretability of naive Bayes and decision trees, while resulting in classifiers 
that frequently outperform both constituents, especially in large datasets. 

4. The Flexible Bayes of George John jS] that use, for continuous attributes, 
a kernel density estimation (instead of the single Gaussian assumption) but 
retains the independence assumption. The estimated density is averaged over 
a large set of kernels: 

where h is the bandwidth parameter and K the kernel shape K = g(x, 0, 1). 
Empirical tests on UGI datasets, flexible Bayes achieves significantly higher 
accuracy than simple Bayes on many domains. 

5. Webb and Pazzani m have also presented an extension to the naive Bayes 
classifier. A numeric weight is inferred for each class using a hill-climbing 
search. During classification, the naive Bayesian probability of a class is 
multiplied by its weight to obtain an adjusted value. The use of this ad- 
justed value in place of the naive Bayesian probability is shown to improve 
significantly predictive accuracy. 

2 Iterative Naive Bayes 

The naive Bayes classifier builds for each attribute a two-contingency table 
that reflects the distribution on the training set of the attribute-values over 
the classes. 

For example, consider the Balance-scale dataset 0. This is an artificial prob- 
lem available at the UGI repository. This data set was generated to model psycho- 
logical experimental results. This is a three class problem, with four continuous 
attributes. The attributes are the left weight, the left distance, the right weight, 
and the right distance. Each example is classified as having the balance scale tip 
to the right, tip to the left, or be balanced. The correct way to find the class is 
the greater of leftjlistance x leftjweight and right jlistance x right-weight. If 
they are equal, it is balanced. There is no noise in the dataset. 
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Because the attributes are continuous the discretization procedure of naive 
Bayes applies. In this case each attribute is mapped to 5 intervals. In an exper- 
iment using 565 examples in the training set, we obtain the contingency table 
for the attribute left- W that is shown in figure 0 



Table 1. A naive Bayes contingency table 



Attribute: 


left. 


W (Discretized) 


Class 


11 


12 


13 


14 


15 


L 


14.0 


42.0 


61.0 


71.0 


72.0 


B 


10.0 


8.0 


8.0 


10.0 


9.0 


R 


86.0 


66.0 


49.0 


34.0 


25.0 



After building the contingency tables from the training examples, suppose 
that we want to classify the following example: 

left_W:l, left_D: 5, right_W: 4, right_D: 2 , Class: R 

The output of the naive Bayes classifier will be something like: 

Observed R Classified R [ 0.277796 0.135227 0.586978 ] 

It says that a test example that it is observed to belong to class R is classified cor- 
rectly. The following numbers refer to the probability that the example belongs 
to one of the classes. Because the probability p{R\x) is greater, the example is 
classified as class R. Although the classification is correct, the confidence on this 
prediction is low (59%). Moreover, taking into account that the example belongs 
to the training set, the answer, although correct, does not seems to fully exploit 
the information in the training set. 

This is the problem that we want to address in this paper: can we improve 
the confidence levels of the predictions of naive Bayes, without degrading its 
performance? The method that we propose, begins with the contingency tables 
built by the standard naive Bayes scheme. This is followed by an iterative pro- 
cedure that updates the contingency tables. We iteratively cycle through all the 
training examples. For each example, the corresponding entries in the contin- 
gency tables are updated in order to increase the confidence on the correct class. 
Consider again the previous training example. The value of the attribute left_W 
is 1. This means that the values in column /I in table Q are used to computed 
the probabilities of equation 0 The desirable update will increase the probabil- 
ity P{R\x) and consequently decreasing both P{L\x) and P{B\x). This could 
be done by increasing the contents of the cell (I1;R) and decreasing the other 
entries in the column II. The same occurs for all the attribute- values of an ex- 
ample. This is the intuition behind the update schema that we follows. Also the 
amount of correction should be proportional to the difference 1 — P{Ci\x). The 
contingency table for the attribute left_W after the iterative procedure is given 
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Table 2. A naive Bayes contingency table after the iteration procedure 



Attribute: left_W (Discretized) 


Class 


11 12 13 14 15 


L 

B 

R 


7.06 42.51 75.98 92.26 96.70 

1.06 1.0 1.0 1.0 1.08 
105.64 92.29 62.63 37.01 20.89 



in figure 0 Now, the same previous example, classified using the contingency 
tables after the iteration procedure gives: 

Observed R Classified R [ 0.210816 0.000175 0.789009 ] 



The classification is the same, but the confidence level of the predict class in- 
creases while the confidence level on the other classes decreases. This is the 
desirable behavior. 

The iterative procedure uses a hill-climbing algorithm. At each iteration, all 
the examples in the training set are classified using the current contingency- 
tables. The evaluation of the actual set of contingency tables is done using equa- 
tion^ 



1 

n 



71 

Ea-0 — argmaXjp{Cj\xi)) 



( 4 ) 



i=l 



where n represents the number of examples and j the number of classes. The 
iterative procedure proceeds while the evaluation function decreases till the max- 
imum of 10 iterations. 

The pseudo-code for the update function is shown in Figure Q To update the 
contingency tables, we use the following heuristics: 



1. The value of each cell never goes below 1.0. 

2. If an example is correctly classified then the increment is positive, otherwise 
it is negative. To compute the value of the increment we use the following 
heuristic {l.Q—p{Predict\x)) / ^Classes. That is, the increment is afunction 
of the confidence on predicting class Predict and of the number of classes. 

3. For all attribute- values observed in the given example, the increment is added 
to all the entries for the predict class and half of the increment is subtracted 
to the entries of all the other classes. 



The contingency tables are incrementally updated each time a training exam- 
ple is seen. This implies that the order of the training examples could influence 
the final results. This set of rules guarantees that after one example is seen, the 
update schema will increase the probability of the correct class. Moreover, there 
is no guaranty of improvement for a set of examples. 
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Input: A contingency Table, a given Example, predicted class, increment 

Procedure Update (Table, Example, predict, increment) 

For each Attribute 
For each Class 

If (Class == predicted) 

Table (Attribute, Class, Example (Attribute value)) +=increment 
Else 

Table (Attribute, Class, Example (Attribute value)) -=increment 
Endlf 

Next Class 
Next Attribute 
End 



Fig. 1. Pseudo-code for the Update function. The increment is a positive real 
number if the example is correctly classified and a negative number otherwise. 



3 Related Work 

The work of Webb and Pazzani IE! clearly illustrates the benefits of adjusting 
the priors of a class. In their work a numeric weight is inferred for each class. 
During classification, the naive Bayes probability of a class is multiplied by its 
weight to obtain an adjusted value. This process has been shown to improve 
significantly the predictive accuracy. 

Our method, instead of adjusting the priors of a class, adjusts the term 
P{xj\Ci). In our perspective this has similarities with the process of computing 
the weights in a linear machine]^. Defining one Boolean attribute for each value 
of an attribute and applying logarithms to equation we obtairQ: 

log{P{C,\x)) = log{P{a)) + ^ |Q)) (5) 

3 

This equation shows that naive Bayes is formally equivalent to a linear ma- 
chine t^. Training a linear machine is an optimization problem that has been 
strongly studied for example in the neural network community [|T?| and machine 
learning tmn . In this section we review the work done in the machine learning 
community. 

The Absolute Error Correction Rule ^ is the most common method used to 
determine the coefficients for a linear machine. This is an incremental algorithm 
that iteratively cycles through all the instances. At each iteration each example 
is classified using the actual set of weights. If the example is misclassified then 
the weights are updated. Supposing that the example belongs to class i and is 
classified as j with i ^ j, then the weights Wi and Wj must be corrected. The 
correction is accomplished by Wi ^ Wi -\- kY and Wj ^ Wj — kY, where the 
correction 



1 



Ignoring the term P{x) since it is the same for all classes. 
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k = 



{W, - W,)^Y 
2YTY 



( 6 ) 



causes the updated linear machine to classify the instance correctly (e is a small 
constant). If the instances are linearly separable, then cycling through the in- 
stances allows the linear machine to partition the instances into separate convex 
regions. If the instances are not linearly separable, then the error corrections will 
not cease, and the classification accuracy of the linear machine will be unpre- 
dictable. To deal with this situation two variants often referred in the literature 
are: 



1. Pocket Algorithm, 

that maximizes the number of correct classifications on the training data. 
It stores in P the best weight vector W, as measured by the longest run 
of consecutive correct classifications, called the pocket count. A LTU based 
on the pocket vector P is optimal in the sense that no other weight vector 
visited so far is likely to be more accurate classifier. 

2. Thermal Training, 

that converges to a set of coefficients by paying decreasing attention to large 
errors. This is done by using the correction factor: c = where (3 is 
annealed during training and k is given by equation 0 The training algo- 
rithm repeatedly presents examples at each node until the linear machine 
converges. 

Also the logistic discriminant im optimize the coefficients of the linear ma- 
chine using gradient descent. The algorithm maximizes a conditional likelihood 
that is given by: 

L(/3i,...,/3,_i)= n P{C,\x) [] P{C 2 \x)... [] P{Cg\x) (7) 

xeCi xeC2 kgc . 

The vector of coefficients is updated only after all the instances has been visited. 



4 Empirical Evaluation 

We have compared a standard naive Bayes against our Iterative proposal on 
twenty-five benchmark datasets from UCI0. 

For each dataset we estimate the error rate of each algorithm using a 10-fold 
stratified cross-validation. We repeat these process 10 times, each time using 
a different permutation of the dataset. The final estimator of the error is the 
average of the ten iterations. 

The empirical evaluation shows that the proposed method not only improves 
the confidence level on 25 benchmark datasets from UCI but also improves the 
global error rate. Results has been compared using paired t-tests. The confidence 
level was set to 99%. A -I- (— ) sign means that the Iterative Bayes obtains a 
better result with statistical significance. On eleven datasets the Iterative Bayes 
produce better results and only loses in one dataset. A summary of comparative 
statistics is presented in table g] 
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Table 3. Comparison naive Bayes versus Iterative Bayes 



Dataset naive Bayes Iterative Bayes 



Australian 


14.27 ±4.4 


14.25 ±4.3 


Balance 


8.57 ±1.2 ± 


7.92 ±0.6 


Banding 


22.95 ±7.5 ± 


21.23 ±7.2 


Breast (W) 


2.69 ±1.9 


2.69 ±1.9 


Cleveland 


16.65 ±6.3 


16.67 ±6.2 


Credit 


14.40 ±4.4 


14.17 ±4.0 


Diabetes 


23.88 ±4.6 


23.31 ±4.4 


German 


24.28 ±3.9 


24.27 ±3.8 


Glass 


35.85 ±10.2 


36.86 ±9.9 


Heart 


16.11 


±6.3 


15.67 ±6.6 


Hepatitis 


17.59 


±9.6 


16.93 ±9.7 


Ionosphere 


11.45 


±5.7 ± 


9.25 ±4.7 


Iris 


4.27 


±5.1 ± 


2.80 ±3.9 


Letter 


40.43 


±1.0 - 


41.10 ±1.0 


Monks- 1 


25.01 


±4.6 


25.01 ±4.6 


Monks-2 


34.42 


±2.5 ± 


32.86 ±0.6 


Monks-3 


2.77 


±2.3 


2.77 ±2.3 


Mushroom 


3.37 


±0.7 


2.65 ±3.6 


Satimage 


19.05 


±1.5 ± 


18.79 ±1.5 


Segment 


10.19 


±1.8 ± 


9.71 ±1.8 


Sonar 


25.84 


±9.0 ± 


23.74 ±8.5 


Vehicle 


38.68 


±4.2 ± 


36.66 ±4.6 


Votes 


9.83 


±4.4 ± 


8.24 ±4.8 


Waveform 


18.87 


±2.0 ± 


16.69 ±2.0 


Wine 


1.97 


±3.0 


2.08 ±3.2 



4.1 Bias- Variance Decomposition 

The Bias- Variance decomposition of the error rate 0 is a useful tool to under- 
stand the behavior of learning algorithms. Using the decomposition proposed in 
0 we have analyzed the error decomposition of naive Bayes and Iterative Bayes 
on the datasets under study. To estimate the bias and variance, we first split the 
data into training and evaluation sets. From the training set we obtain ten boot- 
strap replications used to build ten classifiers. We ran the learning algorithm on 
each of the training sets and estimate the terms of the variance and bias using 
the generated classifier for each point x in the evaluation set. All the terms were 
estimated using frequency counts. 

Figure|3 presents, for each dataset, a comparison between the algorithms. On 
most datasets, we verify that any Iterative Bayes is able to reduce, in comparison 
with naive Bayes, both components. Table [^presents an average summary of the 
results. 
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Table 4. Summary of results 



naive Bayes Iterative Bayes 



Average Mean 


17.74 


17.05 


Geometric Mean 


13.28 


12.46 


Nr. of wins 


4 


18 


Nr. of Significant wins (t-test) 


1 


11 


Average Ranking 


1.8 


1.2 




Fig. 2. Bias- Variance decomposition. For each dataset it is shown for Bayesit 
and naive Bayes in this order 



4.2 Learning Times 

Iterative Bayes is necessarily slower than naive Bayes. Figure 0 shows the accu- 
mulative learning times for the experiments reported in this paper. This figure 
suggest that Iterative Bayes has the same time complexity as Bayes although 
with greater slope. 

4.3 The Attribute Redundancy Problem 

The naive Bayes is known to be optimal when attributes are independent given 
the class. To have insights about the influence of attribute dependencies in the 
behavior of Iterative Bayes, we have performed the following experiment. We 
ran naive Bayes and Iterative Bayes on the original Balance-scale dataset. 
We obtained a redundant attribute by duplicating one attribute. We reran naive 
Bayes and Iterative Bayes on the new dataset. Table 0 presents the results 
of duplicating one, two, and three attributes. The results were obtained using 
10-cross validation. 
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Table 5. Bias Variance decomposition 



Bias Variance 
naive Bayes 15.6 1.7 

Iterative Bayes 14.7 1.6 



Acumulat^ times comparison 




a Bayesit 



10000 . 



1000 - 



100 - 












10 - 












1 +- 



10 16 20 
Nr. Datasets 



26 



30 



Fig. 3. Learning times comparison (logscale) 



We can observe that while the error rate of the naive Bayes doubles, the 
Iterative Bayes maintains the same error rate. While naive Bayes strongly de- 
grades its performance in the presence of a redundant attribute the Iterative 
Bayes is not affected. These results indicate that Iterative Bayes could be a 
method to reduce the well known bottleneck of attribute dependences in naive 
Bayes learning. 



5 Conclusions and Future Work 

In this paper we have presented an iterative approach to naive Bayes. The Iter- 
ative Bayes begins with the distribution tables built by the naive Bayes followed 
by an optimization process. The optimization process consists of an iterative 
update of the contingency tables in order to improve the probability class distri- 
bution associated with each training example. The Iterative Bayes uses exactly 
the same representational language of naive Bayes. As such, both models has the 
same degree of interpretability. Experimental evaluation of Iterative Bayes on 
25 benchmark datasets shows minor but consistent gains in accuracy. In eleven 
datasets statistical significant gains were obtained, and only in one dataset a 
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Table 6. A case study on redundant attributes 





Original Data 


1 Att. 


dup 


2 Att. 


dup 


3 Att. 


dup 


cv 


Bayes — It. Bayes 


Bayes — It. Bayes 


Bayes — It. Bayes 


Bayes — It. Bayes 


1 


8.95 


7.84 


16.96 


7.84 


17.26 


7.99 


19.80 


8.31 


2 


8.71 


7.99 


17.04 


7.99 


17.60 


8.15 


20.15 


8.39 


3 


8.68 


7.94 


16.85 


7.94 


17.34 


8.10 


19.99 


8.42 


4 


8.63 


7.91 


17.05 


7.92 


17.44 


8.03 


20.16 


8.36 


5 


8.66 


7.90 


17.09 


7.90 


17.38 


8.03 


20.09 


8.31 


6 


8.63 


7.92 


17.09 


7.92 


17.39 


8.05 


20.16 


8.34 


7 


8.66 


7.90 


17.09 


7.90 


17.42 


8.02 


20.13 


8.29 


8 


8.65 


7.90 


17.14 


7.92 


17.44 


8.01 


20.22 


8.23 


9 


8.60 


7.91 


17.10 


7.92 


17.43 


8.01 


20.11 


8.21 


10 


8.57 


7.92 


17.10 


7.93 


17.39 


7.99 


20.14 


8.20 



significant degradation was observed. An interesting side effect of our algorithm 
is that it shows a stronger independence of attribute dependences. 

Future research will explore different updates rules. Moreover, the indepen- 
dence over redundant attributes requires a closer examination. We intend to 
perform a more extensive study of this issue. 
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Abstract. The paper describes the causal law mining from an incom- 
plete database. First we extend the definition of association rules in 
order to deal with uncertain attribute values in records. As Agrawal’s 
well-know algorithm generates too many irrelevant association rules, a 
hltering technique based on minimal AIC principle is applied here. The 
graphic representation of association rules validated by a filter may have 
directed cycles. The authors propose a method to exclude useless rules 
with a stochastic test, and to construct Bayesian networks from the re- 
maining rules. Finally, a schem for Causal Law Mining is proposed as an 
integration of the techniques described in the paper. 



1 Introduction 

Acquisition of causal laws is an important new research field. When we try to 
discover causal laws from database, we often encounter situations where the 
information is not complete. The ability to infer unknown attribute values is 
essential to obtain causal laws. 

Bayesian network {Bayesian belief network) is a graph representation of un- 
certain knowledge and has ability to infer unknown values. The nodes and 
arcs in Bayesian networks correspond to causal relationships. Bayesian networks 
are, however difficult to construct automatically from a large amount of observed 
data, because of a combinatory explosion of generating network structures. Thus, 
we need an efficient method to determine co-occurrence relations, which are can- 
didates for causal relations. 

To find co-occurrences, we can use a well-known algorithm Pj to generate 
association rules |2|, each of which defines the co-occurrence relation between 
the left hand side (LHS) and right hand side (RHS). However, there are two 
problems that remain unsolved for the purpose of causal law mining. 

The first problem is that the algorithm produces many irrelevant rules to 
causal relations. It is necessary to filter out irrelevant rules with a stochastic 
test. It is also necessary to adjust the thresholds which decide the minimum 
degree of rule confidence and support, before applying the algorithm. 

The second problem is how to generate Bayesian networks from association 
rules after the filtering. Although it is possible to infer unknown attributes by 
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means of Bayesian networks, Bayesian networks have the constraint that directed 
cycles are not allowed. As generated association rules may contain directed cy- 
cles, it is necessary to exclude useless rules in directed cycles. 

To solve the first problem, the authors introduce two types of rule filters g] 
which exclude irrelevant rules by means of a dependency check of RHS versus 
LHS. When we check the dependency, we assume a dependent model and an 
independent model for each target rule, and compare these two models with 
Akaike’s Information Criterion (AIC) |^. By this method, we can filter out 
irrelevant association rules without adjusting thresholds. 

Our suggestion for the second problem is as follows. We set the cost of arcs 
in a graphic representation of validated rules. The arc cost is the AIC value of 
a dependent model corresponding to the arc. We find a set of arcs which break 
directed cycles, such that the total cost of the arcs is minimal. After excluding 
the set of arcs, we can obtain an optimal acyclic network structure. 

Finally, we propose a process of causal law mining, by integrating the above 
mentioned techniques. 



2 Data Model 

We limit ourselves to observed data for cases which can be represented as a set of 
attributes over a domain {T, F,U}. Here, U stands for undetermined (intuitively: 
either true or false), T for true, F for false. Readers may suppose this limitation 
is inconvenient to deal with real world applications. Note however, that the 
limitation is just a property of our specific method of modeling observed data. 
By a natural extension, we can deal with an attribute whose value is a set of 
any number of symbols. 

In this paper, truth assignment for all observed data is represented as a single 
truth assignment table. The descriptive ability of the table is more general, and 
we now regard the value of an attribute as the probability that the attribute is 
true, rather than a symbol in {T, F,U}. If we find the value of a specific attribute 
in a given case is T (or F), the attribute value in the case is 1.0 (0.0). When the 
value of an attribute is U, the attribute value is greater than 0.0 and less than 
1 . 0 . 

To describe a truth assignment table, we use the symbols in Table 1. 



Cl,- 


■ , Cn 


cases in observed data 


Xi,- 


■ ,Xk 


attributes of a case 


Vi{xi), ■ 


■ ,Vi{XK) 


attribute values in case Ci 


i 


indexes instances, i = 1, - ■ ■ , N 


j 


indexes attributes, j = 1, - ■ ■ , K 



Table 1. Symbols used to describe observed data. 
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3 Association Rule Generation 

3.1 Semantics of Association Rules 

Agrawal et al. ^ introduced a class of production rules, called association rules, 
and gave fast algorithms for finding such rules |0]- 

In this paper, the semantics of the association rule is slightly modified from 
those of the original research, in order to manipulate negation and missing data. 
The form of an association rule in this paper is as follows: 



R: Pi A P2 A - ■■ A Pm ^ q 

where pj, q G {xi,~ixi, • • • , xk, ^xk} , < j < rn < K). 



Each rule has two property values: support and confidence value, which are 
denoted by Sup{R) and Conf{R) respectively. 

The meaning of support is almost same as the original. The way of calculating 
Sup(R) is a natural extension of the original. 



m 

Sup{R) = ■ "^*( 9 )) 

ci 

The meaning of confidence value is quite different from original confidence. 
Conf{R) is the probability that RHS of R holds when LHS of R satisfies. 



Conf{R) 



Ec.(n^i^ife) • Vi(q)) 

Ec,(nr=i^*fe)) 



The above definition needs a little more computational effort to produce 
association rules, but doesn’t affect the efficiency of the original generation al- 
gorithms. 



3.2 Rule Filtering Based on AIC 

Agrawal’s algorithms tend to produce many inadequate association rules because 
the purpose of the algorithms is to discover all rules rather than to verify whether 
a particular rule holds. One method to reduce the number of generated rules is 
to raise the level of minconf in the algorithms. However, the higher minconf 
becomes, the less chance there is to discover the rules. Our suggestion for this 
over-generation is to filter out rules with the two types of stochastic tests. 

The stochastic tests proposed here are the modification of the method used in 

In the paper Q, the filtering method is evaluated with an artificial database 
generated by a telephone call simulator. The simulator generates two types of 
call attempts. One is an ordinary call attempt which occurrs randomly. The 
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other type is a call-relay which is used by frauds. Agrawal’s algorithms generate 
about 30 millions rules from the database, but the number of rules remaining 
after the filtering is about 2,500. After the filtering, no false rules are included 
in the remaining rules, and 70% of the corrected rules are included. 

The two types of filtering method proposed here are as follows. 

Type I Filter 

This Filter validates whether the consequence of a target rule (or RHS) de- 
pends on the condition of the rule (or LHS). This dependency is examined by a 
2x2 contingency table such as Figure 1 (a). In Figure 1 (a), B denotes the con- 
dition and H denotes the consequence, nn represents the occurrence frequency 
where both of B and H hold, and ni 2 represents the frequency where B holds 
but H doesn’t. n 2 i and ri 22 are also defined in the same way. 





H 


B 


^11 ^12 
^21 





H 


B 


Pll Pl2 
p2\ P^2 



(a) (b) 





H 




B 


pq 


pi-q) 


-B 


i'^-p)q 


{\-p){\-q) 



(c) 

Fig. 1. (a) contingency table for a target rule, (b) probabilistic distribution for 
a dependent model for i? — > iJ (3 free parameters), (c) probabilistic distribution 
for an independent model (2 free parameters). 



As we extend the assignment of an unknown attribute value, we now intro- 
duce new semantics for this contingency table. The value of each cell nn,ni 2 , 
n 2 i,n 22 for rule R is defined as: 

m 

nil = 

Ci j = l 



ni2 = ■ (l-Ui(g))) 

Ci j=l 

m 

Ci j = l 
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n22 = ^([[{l-v^{pj)) ■ ( 1 -Vi{q))) 

Ci j = l 

The x^“test is popular means to check the interdependency of items in a 
contingency table. Zembowicz et ah, ^ use a variation of the y^-test called 
Cramer’s V coefficient in order to discover equivalent relations in a database. 
But these methods, such as the y^-test, for a contingency table need some specific 
thresholds. It is very hard to give proper values to these thresholds without trials. 
This causes much inconvenience because a mining process usually takes much 
time. 

A statistical test which needs no thresholds is as follows: 

We should first prepare two kinds of models. 

The first model called a dependent model (DM) assumes that there is a de- 
pendency relation between B and H . This model can be represented in Figure 1 
(b), as using probabilistic distribution parameters. In this model, 3 free param- 
eters are needed to represent any probabilistic distribution, because pn + p \2 + 
P21 + P22 = 1- 

The second model called an independent model (IM) assumes that there are 
no dependency relations between B and H . Figure 1 (c) shows the probabilistic 
distribution of this model which needs two free parameters. 

The AIC is a scoring metric for the model selection as below: 

AIC = -2 X MLL + 2xM 

where MLL is maximum log- likelihood, and M is the number of free 

parameters. 



In the case of DM and IM, the calculation is as below: 

MLL{DM) = logn^) - A^log 

AIC{DM) = -2 X MLL{DM) -k 2 x 3 

MLL{IM) = hlog h + klog k +{n — h) log(n — h) +{n — k) log(n — k) 
-2N\ogN 

AIC{IM) = -2 X MLL{IM) + 2x2 
where h = nn + n\ 2 , k = nn + U 2 i, N = nij 



If AIC{DM) is less than AIC{IM), then DM is a better model. Otherwise 
IM is a better model, and we consider B and H as having no dependent relation. 
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This means that, if AIC{IM) is less than AIC{DM), then the following rules 
should be excluded. 

{B-^ H, B-^ ~^B H, ^B-^ 

If B consists of a single attribute, the following rules are also excluded. 

{H B, H ^B, -nH B, ^B} 

Figure 2 (a)-(c) are examples of contingency tables, and (d) is the AIC of 
both models for (a)-(c). In these examples, our filter excludes (b) and (c), and 
only (a) remains. After this exclusion, we should test the direction of causality 
for (a), because the previous test only investigates the existence of a dependency 
relation between B and H . 









80 20 
600 300 





Xj ^Xj 


Xj 

->Xj 


8 2 
60 30 







X, 

-X, 


70 30 
600 300 



(a) (b) (c) 





AIC(DM) AIC(IM) excludedl 


(a) 


1901.97 


1907.90 


No 


(b) 


1924.06 


1922.52 


Yes 


(c) 


195.60 


194.39 


Yes 



(d) 



Fig. 2. (a),(b),(c) examples of contingency tables, (d) result of AIC for two 
models. (2 free parameters). 



Causality direction is tested based on Conf{R). From the table in Figure 
2 (a), we find eight candidates for causal laws, listed in Table 2. In Table 2, 
rule Xi Xj is incompatible with rule Xj Xi, because Xj should not affect Xi 
when Xi affects xj.. Four rules Xj Xi, ~^Xj Xi, ~^Xi Xj, -'X, ^ ^Xj are 
excluded, because the confidence value of each excluded rule is less than that of 
the incompatible rule. 

Type II Filter 

This Filter validates whether an additional part of LHS contributes to raising 
the confidence value of a target rule. This statistical test is also based on a model 
selection method with AIC. 

Figure 3 (a) shows the contingency table for this test, and Figure 3 (b) and (c) 
are the probabilistic distribution of DM and IM respectively. AIC is calculated 
as follows: 
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Conf{xi Xj) = 0.8 

Conf{xi ^Xj) = 0.2 

Conf{~tXi Xj) = 0.67 

Conf(-tXi ~^Xj) = 0.33 



Conf{xj Xi) = 0.12 
Conf{-^Xj Xi) = 0.06 
Conf{xj ~^Xi) = 0.88 
Conf(-^Xj ~^Xi) = 0.94 



Table 2. candidates for Figure 2 (a) 



MLL{DM) = logn^) — A^logfV 



AIC{DM) = -2 X MLL{DM) + 2x7 



MLL{IM) = /i log /i + fc log fc + Hog Z + t log t + M log u 
+ {N -u) log(iV -u)-2N log N 

AIC{IM) = -2 X MLL{IM) + 2x4 

where h = mi + n2i,k = m2 + U22,l = nzi + ns2, t = nz2 + ^42, 
■u = nil + ni2 + nsi + m 2 





H -// 


5 A S' 

B A -6' 

aB' 
-B A -B' 


/111 ^12 

/loi 

^31 ^32 

/Z 4 I /I 42 



(a) 





H 




B A B' 


Pur 


Pnr 


B A -B' 


PiM-r) 


pA^-r) 


-B aB' 


Piir 


Pl2^ 


-B A -B' 


PiiO^-r) 


P22(l-r) 





H 




B A B' 




5|2 


B A -B' 


■^21 


5 ' 22 


-B aB' 


“^31 


■^32 


-B A -B' 


5'4i 


■S' 42 



(b) (c) 

Fig. 3. (a) contingency table for a target rule, (b) probabilistic distribution for a 
dependent model for BAB/ H {7 free parameter), (c) probabilistic distribution 
for an independent model (3 free parameters). 



After obtaining AIC{DM) and AIC{IM), rule elimination is executed, as 
the Type I Filter elimination. 
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4 Bayesian Network Generation 

A set of association rules, generated and validated as described in 3, should be 
converted to Bayesian networks. 

As mentioned before, the Bayesian network doesn’t allow a directed cyclic 
structure. Therefore we have to break directed cycles in the graph representation 
of generated rules. 

Figure 4 (a) is a simple example of the graph representation of rules. There 
are seven rules in the graph. In Figure 4 (a), we can find directed cycles, such as 
2:1 => X 3 => xi, xi 2:3 2:6 => xi, and so on. Note that large directed cycles 

can be made from a small basic directed cycle. Directed cycles which we have to 
break are such basic directed cycles. Figure 4 (b) shows the modified structure 
with dummy nodes added for belief propagation 0. We may find directed cycles 
in this structure. 

In order to break basic directed cycles of this example, we must remove the 
edges in one of the following sets. 

- {ei} 

- {63,67} 

- {66,67} 

To remove the edges means the loss of information, so we have to choose 
useless edges. The authors suggest the edge cost be set to the AIC value of 
the dependent model in 3.2. By using this cost, we can find the maximum a- 
posteriori (MAP) structure from all structures which have no directed cycles. 
Figure 4 (c) shows an example when the edge cost is obtained. Figure 4 (d) 
shows a MAP structure. 

Affirmative expression and negative expression of the same attribute, such 
as Xj, ^Xj, may appear in the form of generated rules. Such a situation can be 
solved by adding a dummy node whose edges direct to both the affirmative and 
negative nodes. 



5 Proposal towards Causal Law Mining 

Figure 5 shows the process of our method to obtain causal laws from a database. 

At first, an initialized truth assignment table is constructed. In this initial- 
ization stage, the undetermined value in the observed data is assigned as 0.5. 
This assignment makes the entropy of the table maximum. 

The second stage in the process is to generate association rules from a truth 
assignment table, as described in 3. The output is a set of association rules. To 
avoid over-generation of rules, filters based on AIC are used. 

The third stage is to make Bayesian networks from a set of association rules, 
as described in 4. The most important task in this stage is to eliminate less- 
useful rules which break directed cycles in the graph representation of the rule 
set. 
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Fig. 4. (a) graph representation of filtered rules, (b) network structure after 
adding dummy nodes, (c) cost of each rule, (d) optimal network structure. 



The next stage is to infer uncertain attribute values for each of the observed 
cases, and update the truth assignment table. This task is done by the belief 
propagation technique 0. 

The last stage is to evaluate the entropy of the truth assignment table. The 
mining process finishes if we cannot get a better table in a current cycle than in 
a previous cycle. 



6 Related Works 

To construct a Bayesian network from data, several scoring functions are pro- 
posed. Most frequently used scores are the BDe MDL score |^. The BDe 

score is the posterior probability of a network structure. The MDL score is based 
on the MDL principle introduced by Rissanen Once the appropriate score is 
defined, the learning task is to find the network that maximize the score. But this 
task is an intractable problem. To find the network that maximize the BDe score 
is NP-hard j^. Similar arguments also apply to learning with the MDL score 
Eg. Thus, Herkerman et al. Q used a greedy hillelimhling technique. In such 
greedy methods, the search starts at some network (e.g., the empty network) 
and repeatedly applies local changes such as adding and removing edges. 

The method proposed here takes a different approach in the sense that the 
initial local structures are decided by the generation and validation of association 
rules. 
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Observed Data 




Fig. 5. Overview of Causal Law Mining 



7 Conclusion 

We achieved the following in our investigation of causal law mining from an 
incomplete database. 

— We extend the definition of association rules in order to deal with uncertain 
attribute values in rule generation algorithms without loss of efficiency. 

— We also show new methods to filter out irrelevant rules by comparing the 
AIC of a dependent model and an independent model. 
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— We show a new method to generate Bayesian networks from a set of associa- 
tion rules. The method removes less-useful rules which make directed cycles. 
The resultant rules construct a MAP structure. 

— A scheme of Causal Law Mining is proposed. The mining process here is an 
integration of techniques describe in the paper. 
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Abstract. Here is presented CAMLET that is a platform for automatic 
composition of inductive applications using ontologies that specify in- 
ductive learning methods. CAMLET constructs inductive applications 
using process and object ontologies. After instantiating, compiling and 
executing the basic design specification, CAMLET refines the specifica- 
tion based on the following refinement strategies: crossover of control 
structures, random generation and process replacement by heuristic. Us- 
ing fourteen different data sets form the UCI repository of ML databases 
and and the database on meningoencephalitis with human expert’s eval- 
uation, experimental results have shown us that CAMLET supports a 
user in constructing inductive applications with better competence. 



1 Introduction 

During the last twenty years, many inductive learning systems, such as IDS 
IQulf Classifier Systems pTCi^ and data mining systems, have been devel- 
oped, exploiting many inductive learning algorithms. As a result, end-users of 
inductive applications are faced with a major problem: model selection, i.e., 
selecting the best model to a given data set. Conventionally, this problem is re- 
solved by trial-and-error or heuristics such as selection-table for ML algorithms. 
This solution sometimes takes much time. So automatic and systematic guidance 
for constructing inductive applications is really required. 

From the above background, it is the time to decompose inductive learning 
algorithms and organize inductive learning methods (ILMs) for reconstructing 
inductive learning systems. Given such ILMs, we may construct a new induc- 
tive learning system that works well to a given data set by re-interconnecting 
ILMs. The issue is to meta-learn an inductive application that works well on 
a given data set. Thus this paper focuses on specifying ILMs into an ontology 
for learning processes (called a process ontology here) and also an object ontol- 
ogy for objects manipulated by learning processes. After constructing these two 
ontologies, we design a computer aided machine (inductive) learning environ- 
ment called CAMLET and evaluates the competence of CAMLET using several 
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case studies from the UCI Machine Learning Repository and the database on 
meningoencephalitis with human expert’s evaluation. 



2 Ontologies for Inductive Learning 

Considerable time and efforts have been devoted to analyzing the following pop- 
ular inductive learning systems: Version Space [llVlil] , AQ15, IDS |tjui| , C4.5 
|Qu2| , Classifier Systems Back Propagation Neural Networks, Bagged 

C4.5 and Boosted C4.5 |Qu3| . The analysis results first came up with just un- 
structured documents to articulate which inductive learning processes are in 
the above popular inductive learning systems. Sometimes it was a hard issue to 
decide a proper grain size of inductive learning processes. In this analysis, we 
did it under the condition of that the inputs and outputs of inductive learning 
processes are data sets or rule sets. When just a datum or rule is input or out- 
put of processes, they were too fine to be processes. An ontology is an explicit 
specification of a conceptualization. Here in this paper, a process ontology is an 
explicit specification of a conceptualization about inductive learning processes 
and an object ontology is about objects manipulated by them. In structuring 
many inductive learning processes into a process ontology, we got the following 
sub-groups in which similar inductive learning processes come together at the 
above-mentioned grain size: “generating training and validation sets” , “generat- 
ing a rule set” , “estimate data and rule sets” , “modifying a training data set” 
and “modifying a rule set”, with the top-level control structure as shown in 
Figure 0 



modifying 

training 

set 



start 



generating 
training/ 
validation sets 



generating 
a rule set 




estimate data/ I 

rule sets — i 

modifying 
rule set 



end 



Fig. 1. Top-level Control Structure 



Although we can place finer components on the upper part, they seem to 
make up many redundant composition of inductive learning systems. Thus these 
five processes have been placed on the upper part in the process library, as 
shown in Figure 0 On the other hand, articulated object concepts are the data 
structures manipulated by learning processes. At the rest of this section, we 
need to specify conceptual hierarchies and conceptual schemes (definitions) on 
two ontologies. 



Design and Evaluation of an Environment to Automate 



105 



2.1 Process Ontology 

In order to specify the conceptual hierarchy of a process ontology, it is important 
to identify how to branch down processes. Because the upper part is related with 
general processes and the lower part with specific processes, it is necessary to 
set up different ways to branch the hierarchy down, depending on the levels of 
hierarchy. 
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Fig. 2. Hierarchy of Process Ontology 



In specifying the lower part of the hierarchy, the above abstract component 
has been divided down using characteristics specific to each. For example “gen- 
erating a rule set” has been divided into “(generating a rule set) dependent on 
training sets” and “(generating a rule set) independent of training sets” from 
the point of the dependency on training sets. Thus we have constructed the 
conceptual hierarchy of the process ontology, as shown in Figure In Figure 0 
leaf nodes correspond to the library of executable program codes that have been 
written in C, where “a void validation set” denotes that it does not distribute 
learning set into training/validation sets and that a learning system uses train- 
ing set instead of validation set when it estimate a rule set at the learning stage, 
“window strategy” denotes that it refines a training set using extra-validation 
set which is out of character with existing rules. 

On the other hand, in order to specify the conceptual scheme of the process 
ontology, we have identified the learning process scheme including the following 
roles: “input”, “output” and “reference” from the point of objects manipulated 
by the process, and then “pre-process” just before the defined process and “post- 
process” just after the defined process from the point of processes relevant to 
the defined process. 
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2.2 Object Ontology 

In order to specify the conceptual hierarchy of the object ontology, we use the 
way to branch down the data structures manipulated by learning processes, such 
as sets and strings, as shown in Figure 0 where “learning set” is an input data 
set into a learning system, “training set” is data set used for generating a rule set, 
“validation set” is used for estimating a rule set at the learning stage and “test 
set” is used for estimating a rule set at termination of learning stage. Because 
objects contribute less to construct inductive learning systems than processes, 
object scheme has less information than process. So it has just one role “process- 
list” that is a list of processes manipulating the object. 
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Fig. 3. Hierarchy of Object Ontology 



3 Basic Design of CAMLET 

Figure 0 shows the basic activities for knowledge systems construction using 
problem solving methods (PSMs) ivmi . In this section, we apply the basic activ- 
ities to constructing inductive applications using process and object ontologies. 

The construction activity constructs an initial specification for an inductive 
application. CAMLET selects a top-level control structure for an inductive learn- 
ing system by selecting any path from “start” to “end” in Figure 0 Afterwards 
CAMLET retrieves the leaf-level processes subsumed in the selected top-level 
processes, checking the interconnection from the roles of pre-process and post- 
process from the selected leaf-level processes. Thus CAMLET constructs an ini- 
tial specification for an inductive application, described by leaf-level processes 
in process ontology. 

The instantiation activity fills in input and output roles of leaf-level pro- 
cesses from the initial specification, using data types from a given data set. The 
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Fig. 4. Basic Activities 



values of other roles, such as reference, pre-process and post-process, have not 
been instantiated but come directly from process schemes. Thus an instantiated 
specification comes up. Additionally, the leaf-level processes have been filled in 
the process-list roles of the objects identified by the data types. 

The compilation activity transforms the instantiated specification into exe- 
cutable codes using a library for ILMs. When the process is connected to another 
process at implementation details, the specification for I/O data types must be 
unified. To do so, this activity has such a data conversion facility that converts 
a decision tree into classifier. 

The test activity tests if the executable codes for the instantiated specifi- 
cation performs well, checking the requirement (accuracy) from the user. The 
estimation will come up to do a refinement activity efficiently, which is explained 
later. This activity estimates the specification on a top-level control structure in 
Figure Q and four sub-control structures in Figure 0 These sub-control struc- 
tures have following means. 1: generate a rule set simply. 2: refine a rule set. 3: 
refine a training set. 4: aggregate rule sets that are learned by each training set 
having different combination of data. Moreover, it estimates compactness of the 
knowledge representations for heuristic strategy mentioned later. 

When the executable codes do not go well, the refinement activity comes 
up in order to refine or reconstruct the initial specification and get a refined 
specification back to the instantiation activity. The refinement activity is a kind 
of search task for finding out the system (or control structure) satisfied with a 
goal of accuracy. Although several search algorithms have been proposed, ge- 
netic programming (GP) is popular for composing programs automatically. GP 
performs well for global search but no so well for local search. So, in order to 
solve this problem, we present a the hybrid search that combines GP with a local 
search with several heuristics based on empirical analysis. This activity has been 
done with the following three strategies: crossover of control structures, random 
generation and replacement of system components. 
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Fig. 5. Sub-Control Structures 



Crossover of control structures makes a new specification from two parent 
specifications. This operation works like G-crossover which is a kind of genetic 
operations on GP. Because evaluation values are added to sub-control structures 
at test activity, CAMLET can identify just sub-control structures with better 
evaluation values from parent specifications and then put them into one new 
child specification, as shown in Figure 0 
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Fig. 6. Crossover of Control Structures 



Random generation makes a specification with a new top-level control struc- 
ture in the same way as construction activity. This activity is done, keeping 
various control structures in population. 

Replacement of system components changes sub-control structures using sev- 
eral heuristics, which replace one process in a sub-control structure with another 
process from process ontology based on estimation results. For example, IF there 
is a great disparity among the accuracy on each class, THEN trade a current 



Design and Evaluation of an Environment to Automate 



109 



process of generating training/validation sets in for the other one. Replacement 
of systems components is a kind of local search. 




goal of accuracy 



data set 




machine learning system (rule base) 



Fig. 7. An Overview of CAMLET 



Figure Q summarizes the above-mentioned activities. A user gives a learning 
set and a goal of accuracy to CAMLET. CAMLET constructs the specification 
for an inductive application, using process and object ontologies. When the spec- 
ification does not go well, it is refined into another one with better performance 
by crossover of control structures, random generation and replacement of system 
components. To be more specific, in the case of a system’s performance being 
higher than <5(= 0.7* goal accuracy), CAMLET executes the replacement of sys- 
tem components. If not so, in the case of that system population size is equal or 
larger than some threshold (TV > r = 4), CAMLET executes crossover of con- 
trol structures, otherwise, executes random generation. All the system refined 
by three strategies get into a system population. As a result, CAMLET may 
(or may not) generate an inductive application that satisfies the user’s target 
accuracy. When it performs well, the inductive application can learn a set of 
rules that work well to the given learning set. 

4 Case Studies Using UCI ML Repository 

We have implemented CAMLET in JAVA, including the implementations of 
eighteen components in the process ontology with C. Figure 0 shows a typical 
screen of CAMLET. We conducted case studies of constructing inductive ap- 
plications for the fourteen different data sets from the UCI Machine Learning 
Repository. Five complete 5-fold cross-validations were carried out with each 
data set. 
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Fig. 8. CAMLET Browser 



Table 1. Comparison of CAMLET and Popular Inductive Learning Systems 
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The results of these trials appear in Table ^ For each data set, the second 
to fifth column show mean error rates over the five cross-validations of popular 
inductive learning systems, such as C4.5, IDS, Classifier Systems, Bagged C4.5. 
The final column shows the results from inductive applications constructed by 
CAMLET, including resulted error rates, given goal accuracies (error rates) and 
the specifications of inductive applications constructed by CAMLET. Here, goal 
accuracies are set on the basis of the lowest error rates among five popular 
inductive learning systems. Table 0shows us that CAMLET constructs inductive 
applications with best performance. 




Fig. 9. Specification of A New System 
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Looking at the inductive applications constructed by CAMLET in Table Q 
it shows us that the error rates of the inductive applications constructed by 
CAMLET are almost over those of popular inductive learning systems. Table Q 
shows us the rough specifications of all the inductive applications constructed by 
CAMLET. They include two new specifications of new inductive applications, 
which are different from eight popular inductive learning systems that we have 
analyzed. New(l) consists of void validation set, entropy -I- information ratio, 
apportionment of credit, and GA. This application makes it possible to replace 
coarse sub-tree with fine sub-tree using GA, as shown in Figure 0 On the other 
hand, new(2) consists of bootstrap, star algorithm, apportionment of credit, and 
window strategy I5N7T1 . Because the star algorithm process generates just rules 
covering positive instances and not covering all negative instances, it tends to 
have over-fitting to a given training data set. In order to reduce the over-fitting, 
the training data set has a bootstrap process and is polished through window 
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strategy process. In these case studies, these specifications have been generated 
four times to the following four data sets: credit-s, vote, labor and zoo. 

Both the combination of decision tree and GA and the combination of star al- 
gorithm, bootstrap and window strategy seem to be promising patterns to design 
inductive applications. We could collect more design patterns for constructing 
inductive applications through more case studies. Thus it is all right to say that 
CAMLET works well as a platform of constructing inductive applications. 

5 A Case Study Using Medical Database with 
Evaluations from a Domain Expert 

Dr. Shusaku Tsumoto has collected the data of patients who suffered from menin- 
gitis and were admitted to the department of emergency and neurology in several 
hospitals. He worked as a domain expert for these hospitals and collecting those 
data from the past patient records (1979 to 1989) and the cases in which he made 
a diagnosis (1990 to 1993). The database consists of 121 cases and all the data are 
described by 38 attributes, including present and past history, laboratory exam- 
inations, final diagnosis, therapy, clinical courses and final status after the ther- 
apy. Important issues for analyzing this data are: to find factors important for di- 
agnosis (DIAG and DIAG2), ones for detection of bacteria or virus(CULT_FIND 
and CULTURE) and ones for predicting prognosis(C_COURSE and COURSE). 



Table 2. Evaluation from the Domain Expert 
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CAMLET has been applied to the database on meningoencephalitis. Table 
2 shows us the rough specifications (second column) and the accuracy (third 
column) of inductive applications generated by CAMLET, and then the number 
of rules (fourth column) and the number of rules that the domain expert takes as 
good ones (fifth column) generated by the inductive applications. The medical 
expert says that the following rules are effective: 

[LOC < 2] ^ [Prognosis = good] (1) 

which shows that if a patient with loss of consciousness (LOC) came to the 
hospital within two days after LOC was observed, then his/her prognosis is good. 
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[COLD < 9.0] A [Headache < 21.0] A [LOC > 4.0] A 
[Lasegue = 0] A [CSF.Pro > 110.0] A [CSF.Cell7 > 6.0] 

— > [Prognosis = notgood\ (2) 

where the negative Lasegue sign is unexpected by medical experts. 

Thus some rules generated by the inductive applications constructed by 
CAMLET come up with medical expert’s unexpectedness and appreciation. 

6 Related Work 

Although basic activities based on PSMs have usually been manually constructed 
knowledge systems, CAMLET tries to automate the basic activities at the level 
of specifications (not yet at the level of program source codes) . 

Besides PSMs, there are several other ontologies specific to processes, such 
as Gruninger’s enterprise ontologies specific to business processes and PIE and 
software process ontologies. Although our learning process ontology has some 
similarities to PSMs and other process ontologies about how to decompose pro- 
cesses, it decomposes processes, using more information specific in the field of 
task-domain (learning in our case). 

From the field of inductive learning, CAMLET has some similarities to MSL. 
MSL tries to put two or more machine learning systems together into an uni- 
fied machine learning system with better competence iMun . MSL does not de- 
compose machine learning systems (the adaptation of machine learning systems 
sometimes comes up for interconnection) . So the grain size of the components in 
CAMLET is much finer than the grain size of ones in MSL. Furthermore, MSL 
has no competence to invent a new machine learning system like CAMLET. 

MLC-I--I- |KS1[ is a platform for constructing inductive learning systems. 
However, MLC-I— I- has no facility for automatic composition of inductive learning 
systems like CAMLET. MEDIA-model lEnll is a reference structure for the 
application of inductive learning techniques. It focuses on methodological aspects 
of inductive applications but not on automatic composition facilities of inductive 
applications. 

7 Conclusions and Future Work 

We put recent efforts on specifications and codes for ontologies and less on 
efficient search mechanisms to generate inductive applications with best perfor- 
mance. We need to make the refinement activity more intelligent and efficient 
with failure analysis and parallel processing. The refinement activity should be 
able to invent new learning processes and new objects manipulated by learning 
processes. 

In ML technic, the following techniques are popular techniques: bagging, 
boosting, Bayesian network and q-learning. CAMLET takes a little account of 
bagging and boosting, but leaves Bayesian network and q-learning out of account. 
So, we have to extend process and object ontologies adding these ML techniques. 
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Abstract. We discuss the significance of designing views on data in a 
computational system assisting scientists in the process of discovery. A 
view on data is considered as a particular way to interpret the data. In the 
scientihc literature, devising a new view capturing the essence of data is 
a key to discovery. A system HypothesisCreator, which we have been 
developing to assist scientists in the process of discovery, supports users’ 
designing views on data and have the function of searching for good 
views on the data. In this paper we report a series of computational 
experiments on scientific data with HypothesisCreator and analyses 
of the produced hypotheses, some of which select several views good for 
explaining given data, searched and selected from over ten millions of 
designed views. Through these experiments we have convinced that view 
is one of the important factors in discovery process, and that discovery 
systems should have an ability of designing and selecting views on data in 
a systematic way so that experts on the data can employ their knowledge 
and thoughts efficiently for their purposes. 



1 Introduction 

It would be a key to the success of scientific discovery process for developers 
of discovery systems how to realize human intervention in such systems. For 
example, Lee et al. P! reported on a successful collaboration among computer 
scientists, toxicologists, chemists, and a statistician, in which they iteratively 
used an inductive learning program to assist scientists in testing their intuitions 
and assumptions, and attained predictive rule sets to identify rodent carcino- 
genicity and non-carcinogenicity of chemicals. Their investigation was explicitly 
supported by experts’ intervention in the following way: Rule sets learned in 
each experiment were evaluated by collaborating experts, and the outcome of 
evaluations was then used to either revise previously used semantic assumptions 
and syntactic search biases, or create new constraints in subsequent experiments. 

Actually, Langley PI strongly recommended that discovery systems provide 
more explicit support for human intervention in the discovery process. We think 
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that such a recommendation is preferable because the discovery process in sci- 
entific domains can be regarded as complex humans’ activities of searching for a 
new scientific knowledge. To provide more support for scientists in the discovery 
process, we should consider how to realize human intervention in computational 
discovery systems. 

As a way of introducing human intervention in discovery systems, we have 
focused our attention on the concept of views on data HH]. Informally, a viewscope 
V is defined as a pair of a polynomial-time algorithm ip of interpreting data and 
a set P of parameters of the algorithms. We call ip and P the interpreter and the 
viewpoint set of V, respectively. A view on data is a viewscope V = (ip, P) with 
a fixed viewpoint, i.e., |P| = 1. A viewscope on data is considered as particular 
ways to interpret the data. A viewscope designed to reflect a creative thought 
like an intuition, an assumption, and a knowledge of data, provides a method of 
verifying the validity of such a thought on the data. In the scientific literature, 
devising a new view or viewscope capturing the essence of data is a key to 
discovery. If a discovery system allows scientists to design viewscopes on data, 
they can realize the process of discovery in a more smooth way. 

In we have proposed a multi-purpose discovery system GenomicHy- 
pothesisCreator, which allows users to design their own viewscopes on data 
and has a function to search a space of viewscopes for a viewscope suitable for 
making a good hypothesis on the data. One of the features of the system is that 
the process of data interpretation is separated from the process of hypothesis 
generation. This feature makes it possible to evaluate how a viewscope on data 
is good for explaining the data with respect to a selected hypothesis generator. 
In GenomicHypothesisCreator a user can do the following steps: (i) collect- 
ing data from databases, (ii) designing viewscopes on the data, (iii) selecting a 
hypothesis generator whose input is the resulting values of the interpretations of 
viewscopes on the data, (iv) selecting a strategy of search for good viewscopes. 
After the interaction with a user, for a selected hypothesis generator G and a 
selected search strategy S, the system iterates, until the terminate condition of 
S is satisfied, generating a hypothesis from a viewscope with G, and determining 
the next viewscope to be examined along with S. 

We are developing a new version of the system called HypothesisCreator. 
This system can deal with a viewscope of millions of viewpoints and includes sev- 
eral thousands of viewscopes based on pattern matching algorithms and views- 
copes on numerical data. As hypothesis generating components, IDS ITCl for 
constructing decision trees and a standard hierarchical clustering program with 
various distance measures are available. A clustering program AutoClass ^ 
is also available by by invoking as an external hypothesis generator. The system 
is also improved to handle Japanese texts. 

In this paper, we have picked up several examples from our series of compu- 
tational experiments on data of scientific domains with HypothesisCreator. 
The purpose of this paper is not to report some discoveries in such domains, 
but to get insights of how the concept of “viewscope” on data works in the 
process of computational discovery. In each example, we consider how the de- 
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signed viewscopes on data contribute to hypotheses explaining the data. In some 
experiments, ten thousands of viewscopes with ten millions of viewpoints are de- 
signed, and hypotheses which characterize given data well are discovered with 
these viewscopes. These computational experiments suggest that the concept of 
“viewscope” on data is one of the important factors in the discovery process, and 
that a strategy for creating and selecting viewscopes in a systematic way is a 
key to the computational discovery process, which enables experts on domains’ 
data to employ their knowledge and test their intuitions and assumptions effi- 
ciently and smoothly. In the following sections, we shall report computational 
experiments with HypothesisCreator that led us to the above opinion. 

This paper is organized as follows: Section 2 gives the formal definitions 
needed in describing HypothesisCreator. In Section 3 we reports on exper- 
iments on genomic data. In section 4, we consider the role of viewscopes for 
clustering programs as hypothesis generators. Section 5 describes experiments 
on amino acid sequences of proteins to extract knowledge on disordered regions 
of proteins, which is a quite new topic in molecular biology. 

2 Preliminaries 

A viewscope is defined as follows (Terminology is slightly changed from [TTiji: 

Definition 1. Let 17 be a finite alphabet called a data alphabet We call a string 
in S* a S-sequence. Let T be a finite alphabet called a viewpoint representation 
alphabet. A viewscope over 17-sequences is a pair V = (^, P) of an algorithm tp 
taking input {x, y) in 17* x P* and a set P C P* satisfying the following two 
conditions. 

1. For X G S* and y G P*, ip on {x, y) outputs a value ip{x, y) in a set W called 
the value set of V if y G P and “undefined” otherwise. 

2. For X G 17* and y G P* , ip on (x, y) runs in polynomial time with respect to 
|x| and \y\. 

An element in P is called a viewpoint of V, and ip and P are called the 
interpreter and the viewpoint set of V, respectively. A viewscope V = {ip, P) 
over 17-sequences is called a view over 17-sequences if |P| = 1. 



Example 1. We consider a viewscope V = {ip, P) of approximate string matching 
on DNA sequences. A viewpoint in P is in the form of (o, Ig, le, w, k) where o is 
either ‘ ‘ startpoint ' ' or ' ' endpoint ’ ’ , Ig and le are integers with Ig < Ig, w is 
a string on {a, t, g, c}, and fc > 0 is an integer. Given a viewpoint (o, Ig, Ig, w, k) G 
P and a 17-sequence including information about a complete DNA sequence x 
on {a,t,g,c\ and the location of a gene g on x whose start and end points 
are gg and gg, respectively, the interpreter ip extracts a substring i of a; with 
the interval -|- lg,gs + lg\ if o is ‘ ‘startpoint’ ’ and [gg + lg,gg + Ig] if o 
is ‘ ‘endpoint’ ’ (Note that o specifies the origin of a coordinate on x for g). 
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and decides if the substring x contains a substring whose edit distance from 
w is at most k. In the current version of HypothesisCreator, patterns w of 
approximate string matching are automatically extracted from the designed text 
regions with specific lengths. The value set of V is Boolean. As options of the 
viewscopes, selecting mismatch types and filtering strings with alphabet-indexing 
are available. A mismatch has three types: insertion, deletion, and substitution. 
Given a subset of them, the interpreter tp executes matching according to it. 
The default subset is full. Alphabet-indexing m is a mapping from an original 
alphabet to a small alphabet, which is a key to the success in characterization 
of transmembrane domains in knowledge discovery system BONSAI 1191 . When 
this option is selected, text and pattern strings are transformed according to a 
specific alphabet indexing before matching. 

For a set DCS* and a viewscope V = {ip, P) over sequences on S, we call 
the \D\ X |P| matrix D'^ defined by {x, y) = ip{x, y) ior x & D and y & P the 
data matrix of D under the viewscope V . This becomes an input to a selected 
hypothesis generator. 

We have defined operations on viewscopes in m. which provide methods to 
generate new viewscopes from existing viewscopes. The concatenation operation 
is defined as follows: Let Vi = {ipi,Pi) and V 2 = ('02,^2) be viewscopes over 
sequences on S. We assume Pi nP 2 = 0- Let ip~^ be a polynomial-time algorithm 
such that 0“*" on {x, y) simulates ipi on {x, y) if y belongs to Pi for i = 1, 2. Then 
we define the concatenation of Vi and V 2 as the viewscope Vi + V 2 = (ip~^,P~^) 
where P+ = Pi U P 2 . 



3 Genomic Data and Experiments 

Various kinds of knowledge discovery problems in genome science provide an 
interesting opportunity for discovery science. In this section, we describe our 
computational experiments on genomic data with HypothesisCreator. For 
these experiments, we have selected, as data, the annotation files of the 
complete genome sequences of S. cerevisiae whose total size is 43MBytes. These 
files are formatted in the form of DDBJ/EMBL/GenBank Feature Table [^81 , 
each of which consists of three types of objects: the header part, the chromosome 
sequence, and its annotations. S. cerevisiae is a kind of baker’s yeasts, which 
has 16 chromosomes whose DNA sequences are 12,057,849 base pairs in total, 
and which codes 6,565 genes including putative ones. S. cerevisiae has been 
extensively studied, so that it is used as a model organism in biology. 

As a work on knowledge discovery from yeast genome sequences, there is a 
work of Brazma et al. which reported a result of discovering transcription 
factor binding sites from the upstream regions of genes in the genome sequences 
by a method they developed to the highest rating patterns in given sequences, 
which can be thought as a view on sequences. 

In the following two subsections, we describe computational experiments on 
two groups of genes, cyclin genes and DNA replication genes in late G1 phase. 
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It should be noted that it is possible to carry out the following experiments 
for other organisms if their annotated complete genome sequences are provided. 
Actually, we are going into such process, whose result can be shown elsewhere. 

3.1 Cyclin Genes 

Here we describe experiments to generate knowledge explaining cyclin genes of 
S. cerevisiae, which appear to play roles of switches in cell cycles. The somatic 
cell cycle is the period between two mitotic divisions. The time from the end 
of one mitosis to the start of the next is called interphase, divided into the Gl, 
S, G2 periods. A cyclin accumulates by continuous synthesis during interphase, 
but is destroyed during mitosis. Its destruction is responsible for inactivating M 
phase kinase and releasing the daughter cells to leave mitosis. 
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Fig. 1. A hypothesis for cyclin. Views of approximate string matching with insertion 
and deletion are assigned to non-terminal nodes. The view v assigned to the root node 
is interpreted as follows: The pattern of v is LRRISKAD, and the number of allowed 
mismatches are at most 4. For each gene g, v tests whether the pattern approximately 
matches with the subsequence from 225th residue of amino acid to 375th residue of 
amino acid of the translation of g. The gene g flows into the sibling started with 
“+YES->” if the pattern is matched in the subsequence with at most 4 mismatches, and 
flows into the sibling with “+N0 ->” otherwise. The root node has 18 cyclin genes and 
6565 non-cyclin genes. The views assigned to the internal nodes can be interpreted in 
the same way. 



We have a list of cyclin genes as a search result of Yeast Proteome Database 
m- Using the list, HypothesisCreator separates the genes described in the 
annotation files into cyclin genes and the other genes. There are 18 cyclin genes 
and 6565 other genes including RNAs. 

We have designed various kinds of viewscopes as follows: approximate string 
matching viewscopes on DNA sequences, approximate string matching views- 
copes on amino acid sequences, and PROSITE viewscopes on amino acid se- 
quences. As the interpreter of the viewscopes of approximate string matching, 
agrep fast approximate matching algorithm based on bitwise operations. 
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is currently adopted and implemented. An approximate string matching views- 
cope on amino acid sequences is defined in the same way as the case on DNA 
sequences. PROSITE is a database of protein families and domains which 
classified with a kind of regular expressions. The viewpoint set of the views- 
cope is the entries of the database. As an option, adding and deleting PROSITE 
patterns are available. In each experiment, a viewscope passed to a hypothesis 
generator is a concatenation of hundreds or thousands of the above viewscopes. 

As a hypothesis generator, we have selected IDS [El for constructing a deci- 
sion tree whose non-terminal nodes are assigned views selected from the views- 
cope given as input to the hypothesis generator. To obtain the hypothesis in 
Fig.lH we have designed a viewscope with 935,230 viewpoints, which is the con- 
catenation of 3,305 approximate string matching viewscopes with insertion and 
deletion on amino acid sequences. We have repeated such an experiment over 20 
times with different viewscopes, whose total number of viewpoints is 13,631,690. 
The improvement of viewscopes has been done manually because we have wanted 
to see how the changes of viewscopes influence generated hypotheses. After the 
manual viewscope search, we have optimized the pattern lengths automatically 
with a local search strategy. The final output is in Fig. 0 

We can see that 18 cyclin genes are roughly classified into 3 groups in Fig. 
□ The CLB subfamilies, CLBl, CLB2, CLB3, CLB4, CLB5 and CLB6, are 
completely separated from the other genes products by the view of the root node 
(Please read the caption of Fig. Qfor the view assigned to the root node). CLBl, 
CLB2, CLB3 and CLB4 are known as G2/M-phase-specific cyclins, and CLB5 
and CLB6 are B-type cyclins appearing late in Gl. In Yeast Proteome Database 
we can easily find the fact that pairwise identities and similarities of the 6 
translations of the 6 CLB genes products are relatively high. Furthermore, we 
can find that the pattern LRRISKAD is one of the most common subsequences 
among the 6 translations. Among the genes g not matched with the view of 
the root node, the genes CLNl, CLN2, PCLl and PCL2 satisfy the following 
rules: the pattern CLILAAK is matched with the mismatch at most two in the 
interval [90,150] of the translation of g, and the pattern KSN is matched with the 
mismatch at most one in the interval [0,50] of the translation of g. It is known that 
CLNl, CLN2 and PCLl have a common feature that to be Gl/S-specific cyclins. 
For the another group of 6 genes which reach the deepest terminal node in Fig. 
^ we have not characterized them yet. From these experiments, we recognize 
that we can obtain some of characterizations of data by applying volumes of 
viewscopes to data. 



3.2 DNA Replication Genes in Late Gl 

The group of genes we next use is the genes reported as genes related to DNA 
replication in late Gl in j^, whose gene products are the following: RNRl, DUTl, 
DPB3, RFA3, RFA2, POL2, DPB2, CDC9, RFAl, PRI2, CDC45, CDC17, 
CDC21, POL12 and POL30. Here we confuse a gene with its gene product. 
In experiments on the genes, we report an experiment to characterize the genes 
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acgcgt(O) +N0 ->acgcg(0) 
[-200,-150] I [-130,-100] 

(15,6534) I (8,6492) 



+YES->taccat(l) 

[-140,-90] 

(7,42) 



+N0 ->tattacgc(0) +N0 ->cggtcgtaaa(l)+K0 ->NEG(0,6409) 
I [-110,-70] I [10,30] I 

I (3,6409) I (1,6409) +YES->P0S (1,0) 

I I RNRl 



I +YES->P0S (2,0) 

I DPB3,DUT1 

+YES->tgtccat(l) +N0 ->NEG(0,75) 

[-200,-150] I 

(5,83) +YES->aacgaa(l) +N0 ->P0S (5,0) 

[-100,-50] I RFA3,RFA2,P0L2,DPB2,CDC9 

(5,8) I 



+N0 ->NEG(0,34) +YES->NEG(0,8) 

I 



+YES->cactat(l) +N0 ->NEG(0,8) 

[-110,-60] I 
(7,8) I 

+YES->P0S (7,0) 

RFAl , PRI2 , CDC45 , CDC17 , CDC21 , P0L12 , P0L30 



Fig. 2. A hypothesis for DNA replication in late Gl. Approximate string match- 
ing viewscopes with insertion, deletion, and substitution are assigned to non-terminal 
nodes. The viewpoint assigned to the root node is the approximate string matching of 
the pattern “acgcgt” withont any mismatch whose text region is the region from -200 
to -150 of the npstream of each gene. 



in their upstream regions. In the experiment, we have designed 193 approxi- 
mate string matching viewscopes on DNA sequences where the total number of 
viewpoints is 89,225. Fig. 0is the output. 

We can see that, in the created hypothesis, patterns of gc rich are a key. 
7 genes (RFAl, PRI2, CDC45, CDC17, CDC21, POL12, POL30) among the 
15 genes to be characterized satisfy the following rules: Let 5 be a gene. The 
pattern “acgcgt” exists in the upstream region [-200,-150] of g (rule 1), the 
pattern “taccat” matches in [-140,-90] of g with at most one mismatch (rule 
2), and the pattern “cactat” matches in [—110, —60] of g with at most one mis- 
match (rule 3). By these rules, the 7 genes are completely separated from the 
other genes. On the other hand, 5 of the 8 genes not satisfying the rule 1 have 
the pattern “acgcg”, which is almost the same as “acgcgt” of rule 1, in the up- 
stream regions [-130,-100]. We can easily see that 12 genes of 15 have such 
features. TRANSFAC is a database of transcription factors. We can find 
that the sequence “acgcgt” of the root node is a transcription factor binding site 
of CDC21 il4izij . In addition, the pattern “acgcg” is a substring of a transcrip- 
tion factor binding site of CDC9. The first and last positions of factor binding 
sites of the transcription factor are —160th base pairs and —105th base pairs, 
which is overlapped with the text region [—130, —100] of the approximate string 
matching viewscope. This fact would be one of evidences of the effectiveness of 
HypothesisCreator. 

4 Cluster Generation with Viewscopes for cDNA 
Microarray Gene Expression Profile Data 

Clustering is a process of partitioning a set of data or objects into a set of classes, 
that is called clusters. A cluster is a collection of data objects that are similar to 
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one another and thus can be thought as a group of data having some common 
features. Since clustering does not rely on predefined classes and examples, it is 
one of the important strategies in the process of discovery. In fact, Cheeseman 
et al. 0] succeeded in making stellar taxonomies by applying their AutoClass 
system to infrared data. 

In the current version of HypothesisCreator, two clustering programs are 
available as hypothesis generators: One is a built-in basic hierarchical cluster 
generator, in which several similarity and distance functions between objects 
are selectable, for example, Euclid distance and correlation coefficient, and, as 
functions between clusters, nearest neighbor, furthest neighbor, centroid, and 
average are also available. Another is AutoClass invoked by Hypothesis- 
Creator as an external tool. In HypothesisCreator users can process input 
raw data by applying viewscopes according to their purposes, and transfer the 
data to clustering programs. (For AutoClass, HypothesisCreator outputs 
files in the form of input files of AutoClass.) In this way, clustering programs 
are utilized for the process of discovery. 

One of our concern for the application of clustering programs in Hypoth- 
esisCreator is to classify genes from the data of gene expression profiles by 
produced by cDNA microarrays and DNA chips |P3- Actually, biological 
knowledge extraction from gene expression profiles is getting a hot research topic 
in functional genomics jSj. We have made a preliminary experiment with using 
clustering programs of HypothesisCreator. The input raw data we used in 
the experiment is the time course of the amounts of the expressions of the genes 
of S. cerevisiae produced by cDNA microarray The rows for the first four 
genes are shown below. The total number of genes is about 6153. 



ORF Name Rl. Ratio 
YHR007C ERGll 1.12 
YBR218C PYC2 1.18 
YAL051W FUN43 0.97 
YAL053W 1.04 



R2 . Ratio 

1.19 

1.23 

1.32 

1.15 



R3. Ratio R4. Ratio R5. Ratio 

1.32 0.88 0.84 

0.77 0.75 0.79 

1.33 1.18 1.12 

1.33 1.18 1.81 



R6. Ratio R7. Ratio 



0.38 0.43 

0.71 2.7 

0.88 0.93 

0.7 0.96 



We have designed a viewscope on numerical data, whose viewpoints are se- 
quences of basic numerical functions on numeric table, like a logarithmic function 
on each element, a function calculating the ratios of the corresponding row’s el- 
ements in specified two columns, and a function calculating the differences of 
the corresponding row’s elements in specified two columns. The interpreter of 
the viewscope applies the functions of a viewpoint in sequence to the current 
data. We have selected the similarity function in 0, which is a form of corre- 
lation coefficient, and, as a distance function between two clusters, the function 
calculating the average of the similarities of all pairs of clusters. 

For the time course data of gene expression, DeRisi et al. 0 showed that 
distinct temporal patterns of induction or repression help to group genes that 
share regulatory properties, and gave five classes of genes, some of which are 
the genes in the rows of classes B, C, D, E, and F of Table g] (For the shared 
regulatory properties of classes, see Fig. 5 of |7j). 

One of the viewpoints in this experiment is as follows: First, the columns from 
Rl. Ratio to R7. Ratio are selected. Second, for each row of the selected columns, 
the element in the jth column is replaced with the gap between the elements in 
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j + 1th and jth columns for j = 1, . . . , 6. Finally, each element is transformed 
with the natural logarithmic function. Comparing DeRisi’s groups of genes with 
collections of clusters produced by HypothesisCreator with various hierar- 
chical levels and viewpoints, a set of thirteen clusters shows similarity to the five 
groups as follows: 
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3 
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3 


3 


3 


3 



Table 1. The numbers 0, 1, 2, 3, 4, 8 represent the identifiers of our clusters. 



In our clustering, the cluster of id 4 overlaps with class B and class D. How- 
ever, we can see from Fig. 5 of Q that both of class B and class D have similar 
temporal patterns of induction. Through this preliminary experiment, we can 
see that processing input raw data is important for making input data of clus- 
tering programs. We will continue to implement new viewscopes and viewpoints 
on numerical data to deal with coming huge data of gene expression. 



5 Knowledge Extraction of Disordered Regions of 
Proteins from PDB 



The conventional view that a protein must attain an ’’ordered” conformation 
which is stable and folded in order to carry out its specific biological function, no 
longer holds for numerous proteins found to have important biological functions 
in the disordered regions mm- In general, disordered regions in proteins serve 
as flexible linkers to allow movement between structured domains mi and also 
sites for binding by specific ligands |^. Disorder in proteins can be identified 
by means of sensitivity to protease digestion im, NMR spectroscopy and 
X-ray crystallography techniques Recently, Garner et al. made an effort in 
predicting disorder regions by neural network algorithm and classifying them 
under 3 groups based on the lengths of disordered regions m- 

Although disorder in protein structures is quite a widely occurred phenom- 
enon it is not as widely recognized by the biological community due to the lack 
of collective data or systematic studies of it. In this preliminary studies, we gen- 
erate hypothesis by means of designing views using the HypothesisCreator. 
Our aim is to create a database of disorder and possible disorder proteins from 
PDB. The hypothesis generated by our views are used to classify these groups 
of disordered and possible disordered proteins into a systematic and accessible 
manner for future data retrieval. 
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The data for our experiments are taken from Protein Database Bank m. 
whose current 9652 entries include 8358 entries by X-ray diffraction determi- 
nations, 1581 entries by NMR spectroscopy determinations and 223 structures 
proposed by modeling. Candidates for our data are chosen by selecting from 
entries which contain the keywords ’’DISORDER” and ’’GAP” in the header 
text. ’’DISORDER” denotes a case of disorder irrespective of its nature. ’’GAP 
IN ENTRY” statement would provide the most likelihood for a protein to be 
disordered. These entries are then extracted using MetaCommander script uni. 
A total of 193 entries which contain the keyword ’’GAP” and 15 of which has the 
’’DISORDER” keyword are used in our training set. 24 entries are found to con- 
tain both keywords and are used as positive examples in our training sets. The 
number of 79 viewscopes of approximate string matching are designed, whose 
total number of viewpoints is 13871. 

Our system provides a user-defined length between ordered and disordered re- 
gions. Since it is within the same sequence, eliminate biased representative due to 
residue complexities. Positive examples are generated from disorder regions. We 
have designed viewscopes of approximate string matching on the upstream and 
downstream regions and ordered/disordered regions of examples. Region length 
is specified in the process of designing viewscopes except disordered regions. Hy- 
pothesisCreator generates hypotheses for the proteins that are disordered or 
might be disordered. An example is given in Fig. 0 Our system looks closely 
into upstream and downstream regions that flank the disordered region. 



AGNVRV(O) 

[downstream] 

(24,5098) 



+N0 ->EVKIIG(0) 

I [downstream] 

I (20,5098) 



+N0 ->ASEMID(0) +NQ ->CVFWNq(0) +NQ ->(11,5098) 

I [downstream] I [downstream] I 

I (17,5098) I (14,5098) I 

I I +YES->P0S(3,0) 

I I 1AX9: :505,2ACE; ;505,2ACK: :505 

I +YES->P0S(3,O) 

I IBMF : G : 1 15 , ICOW ; G : 115 , lEFR : G : 1 15 

+YES->POS(3,0) 

IBMF : G ; 69 , ICOW : G : 69 , lEFR : G : 69 



+YES->POS(4,0) 

IHXP ; A ; 35 , IHXP : B : 29 , IHXQ : A : 36 , IHXQ : B ; 30 



Fig. 3. 1HXP:A:35 corresponds to PDBid:chain:the starting residue point of a disor- 
dered region (No chain symbol means that a chain is unique). Downstream/Upstream 
is an ordered region found after/before the disordered region of length 20. A variable 
window size can be selected to represent the downstream/upstream region. 



The root node is assigned the view of approximate string matching with the 
pattern AGNVRV, which completely separates 4 positive examples, 1HXP:A:35, 
1HXP:B:29, 1HXQ:A36, 1HXQ:B:30, from other examples. IHXP and IHXQ are 
nucleotidyl transferases which have 2 subunit chains, A and B. The starting po- 
sitions of disordered regions for both protein chains are given by the numbers 
by their chains. These positions correspond to the beginning of GAP regions 
in the respective proteins (please refer to for the complete length of the 
disordered regions). The hypothesis created proves the validity of our classifi- 
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cations as shown by the output of the same class of proteins. Similarly, pat- 
tern EVKIIG string matching produces 3 positive examples in path NO and 
YES (1BMF:G:69, 1G0W:G:69, 1EFR:G:69). 1BMF:G:115, 1G0W:G:115, and 
1EFR:G:115 are generated from path NO, NO and YES. All these proteins are 
atp phosphorylases. Both patterns matching of different strings produce simi- 
lar results that detect disorder at different location. This clearly implies that 
HypothesisCreator is able to differentiate the different sites of disordered 
regions within the same proteins based on the patterns used. 

In this studies, disorder is being classified as separate groups based on views 
of approximate string matching. Future work to assign stringent criteria for the 
selection and classifications of the disordered regions are in progress in our lab. 
Our ultimate aim is to create a rule classifying ordered and disordered regions. 



6 Conclusion 

We have seen some computational experiments with HypothesisCreator on 
real data for the problems we are facing with in our scientific domain. These 
problems are real and fresh. Through the experiments conducted with Hypoth- 
esisCreator, we have realized the importance of viewscope design in compu- 
tational discovery systems. A key to discovery systems is a viewscope that gives 
an interpretation of raw data. This is consistent with the fact that most of the 
discovery systems in the literature which have shown successful results installed 
good viewscopes or views for their targets. We consider that discovery systems 
should allow domain experts, in some way, to design their own viewscopes easily 
and systematically, or automatically. 

The timeliness is also a key to computational discovery systems since fresh 
data should be quickly analyzed by experts. Domain scientists are eager to ac- 
complish their data analysis as quickly as possible. By installing new viewscopes 
to HypothesisCreator or by interactively searching a space of viewscopes ac- 
cording to the computational results produced by the system, which will be done 
by human experts, HypothesisCreator is expected to quickly cope with new 
data from new domains. Apart from our scientific domain, a challenge to char- 
acterize traditional Japanese poetry, WAKA, is also being made by modifying 
viewscopes of HypothesisCreator to deal with data in Japanese codes. 

Although the current version of HypothesisCreator does not have enough 
ability to fully correspond to these requirements, HypothesisCreator is still 
being developed for further extension and improvement in order to attain higher 
achievements. 
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Abstract. Waka is a form of traditional Japanese poetry with a 1300- 
year history. In this paper we attempt to semi-automatically discover 
instances of poetic allusion, or more generally, to find similar poems in 
anthologies of waka poems. The key to success is how to define the 
similarity measure on poems. We first examine the existing similarity 
measures on strings, and then give a unifying framework that captures 
the essences of the measures. This framework makes it easy to design new 
measures appropriate to finding similar poems. Using the measures, we 
report successful results in finding poetic allusion between two antholo- 
gies Kokinshu and Shinkokinshu. Most interestingly, we have found an 
instance of poetic allusion that has never been pointed out in the long 
history of WAKA research. 



1 Introduction 

Waka is a form of traditional Japanese poetry with a 1300-year history. A waka 
poem is in the form of tanka, namely, it has five lines and thirty-one syllables, 
arranged thus: 5-7-5-7-T.0 Since one syllable is represented by one kana char- 
acter in Japanese, a waka poem consists of thirty-one kana characters. 

Waka poetry has been central to the history of Japanese literature, and has 
been studied extensively by many scholars. Most interestingly, Fujiwara no 
Teika (1162-1241), one of the greatest waka poets, is also known as a great 
scholar who established a theory about rhetorical devices in waka poetry. 

One important device is honkadori (allusive- variations), a technique based 
on specific allusion to earlier famous poems, subtly changing a few words to relate 
it to the new circumstances. It was much admired when skilfully handled. This 
device was first consciously used as a sophisticated technique by Teika’s father 

^ The term WAKA originally meant Japanese poetry as opposed to Chinese poetry, 
but it is frequently used as a synonym of TANKA (short poem), which is the clearly 
dominant form of Japanese poetry, although nagauta (long poem) and SEDOKA 
(head-repeated poem) are included in the term WAKA, as well. 
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Poem alluded to. (Kokinshu #147) 



Anonymous. Topic unknown 
Oh, Cuckoo, you sing 
Now here, now there, all about. 

In a hundred villages. 

So I feel you are estranged to me. 
Though I think you are dear to me. 



Ho-to-to-ki-su 



NA-KA-NA-KU-SA-TO-NO 

A-MA-TA-A-RE-HA 



Na-ho-u-to-ma-re-nu 



O-MO-FU-MO-NO-KA-RA. 



Allusive-variation. (Shinkokinshu #216) Saionji Kintsune. 
At the 1500 game poetry contest 



Ho-to-to-ki-su 

NA-HO-U-TO-MA-RE-NU 

KO-KO-RO-KA-NA 



Oh, Cuckoo, you must be singing 
In some other villages now. 

But in this twilight, 

I cannot feel you are estranged to me. 
Though you are not here with me. 



Na-ka-na-ku-SA-to-no 

YO-SO-NO-YU-FU-KU-RE. 



Fig. 1. An example of poetic allusion. The hyphens are inserted between sylla- 
bles, each of which was written as one KANA character although romanized here. 
English translation due to Dr. Kei Nijibayashi, Kyushu Institute of Technology. 

Fujiwara no Shunzei (1114-1204) and then established both theoretically 
and practically by Teika himself, although its use had begun in earlier times. 
Figure Q shows an example of poetic allusion. 

For interpretations of poems utilizing this device, one must know what poems 
they allude to. Although the poems alluded to might be obvious and well-known 
at the time of writing, they are not so for present-day researchers. The task of 
finding instances of poetic allusion has been carried out, up till now, exclusively 
by human efforts. Although the size of each anthology is not so large (a few 
thousand poems at most), the number of combinations between two anthologies 
is on the order of millions. 

Recently, an accumulation of about 450,000 WAKA poems became available 
in a machine-readable form, and it is expected that computers will be able to 
help researchers in finding poetic allusion. 

In this paper we attempt to semi-automatically detect instances of poetic al- 
lusion, or more generally, similar poems. One reasonable approach is to arrange 
all possible pairs of poems in decreasing order of similarity values, and to exam- 
ine by human efforts only the first 100 pairs, for example. A reliable similarity 
measure on WAKA poems plays a key role in such an approach. 

How to define similarity is one of the most difficult problems in AI. Since a 
WAKA is a natural language text, it seems to require both syntactic and semantic 
analyses. However such analyses are very difficult, especially when the text is a 
poem. In this paper we choose a different approach. That is, we take a poem 
simply as a string, i.e. a chain of characters, and define the similarity measure 
in such a way that no natural language processing technique is required. 

We first re-examine the already existing similarity (dissimilarity) measures 
for strings, and give a unifying framework which captures the essences of the 
measures. This framework describes a measure in terms of the set of common 
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patterns and the pattern scoring function. Under the framework, we design new 
similarity measures which are appropriate for the problem of finding similar po- 
ems. Using these measures, we report successful results in finding similar poems 
between Kokinshu (1,111 poems) and Shinkokinshu (2,005 poems), which are 
known as the best two of the twenty-one imperial anthologies, and have been 
studied most extensively. The results are summarized as follows. 

— Of the most similar 15 of the over 2,000,000 combinations, all but two pairs 
were in fact instances of poetic allusion. 

— The 55th similar pair was an instance of poetic allusion that has never been 
pointed out in the long history of such research. 

Thus the proposed method was shown to be effective in finding poetic allu- 
sion. There have been very few studies on poetic allusion in the other anthologies, 
especially in private anthologies, until now. We hope that applying our method 
to such anthologies will enable us to build up a new body of theory about this 
rhetorical device. 

In a previous work Q, we studied the problem of finding characteristic pat- 
terns, consisting of auxiliary verbs and postpositional particles, from anthologies, 
and reported successful results. The goal of this paper is not to find patterns, 
although we do use patterns in the definition of similarity measures. It may 
be relevant to mention that this work is a multidisciplinary study involving 
researchers in both literature and computer science. In fact, the fifth and sixth 
authors are, respectively, a WAKA researcher and a linguist in Japanese language. 

2 Similarity and Dissimilarity on Strings 

In this section we first give an overview of already existing similarity and dis- 
similarity measures, and then show a unifying framework which captures the 
essences of the measures and makes it easy to design new measures that will be 
appropriate for problems in various application domains. 



2.1 An Overview of Existing Measures 

One simple similarity measure is the length of the longest common subsequence 
(LCS). For example, the strings ACDEBA and ABDAC have two LCSs ADA and ABA, 
and therefore the similarity value is 3. The alignment of the strings in Fig. 0(a) 
illustrates the fact that the string ADA is an LCS of them. The pairs of symbols 
involving the LCS are written one above the other with a vertical bar, and the 
symbols do not involve the LCS are written opposing a blank symbol. Thus the 
LCS length is the number of the aligned pairs with a vertical bar. 

On the other hand, one simple dissimilarity measure is the so-called Leven- 
stein distance between two strings. The Levenstein distance is often referred to 
as the edit distance, and is defined to be the minimum number of editing oper- 
ations needed for converting one string into the other. The editing operations 
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AC DEBA 
1 1 1 


ACDEBA 
1 1 1 


ACDEB A 
1 1 1 


1 1 1 

A BD AC 


1 1 1 
ABD AC 


1 1 1 
A BDAC 


(a) 


(b) 


(c) 



Fig. 2. Alignments 



here are insertion, deletion, and substitution of one symbol. The Levenstein dis- 
tance between the strings ACDEBA and ABDAC is 4, and the alignment shown in 
Fig. |2| (b) illustrates the situation. Note that the second symbols C and B of the 
two strings are aligned without vertical bar. Such a pair corresponds to the sub- 
stitution whereas the unaligned symbols opposed to a blank symbol correspond 
to the insertion or the deletion. One can observe that the alignment in Fig. 0(c) 
gives the LCS ABA with the same length as ADA, but does not give the Levenstein 
distance since it requires five editing operations. 

The similarity and the dissimilarity are dual notions. For example, the LCS 
length measure is closely related to the edit distance in which only the insertion 
and the deletion operations are allowed. In fact the edit distance of this kind is 
equal to the total length of the two strings subtracted by the twice of their LCS 
length. This is not true if the substitution operation is also allowed. 

Sequence comparison for the nucleotide or the amino acid sequences in molec- 
ular biology requires a slightly more complicated measure. A scoring function S 
is defined so that 6{a, b) specifies the cost of substituting symbol a for symbol 
b, and 6(a,e) and S{e,b) specify the costs of deleting symbol a and inserting 
symbol b, respectively. The distance between two strings is then defined to be 
the minimum cost of conversion of one string into the other. This measure is 
referred to as the generalized Levenstein distance. 

In molecular biology, other editing operations are often used. For example, 
a deletion and an insertion of consecutive symbols as a unit are allowed. The 
cost of such an operation is called a gap penalty, and it is given as a function of 
the length of a gap. Usually, an affine or a concave function is used as the gap 
function. Other editing operations such as swap, translocation, and reversal are 
also used (see 0, for example). 

2.2 A Unifying Scheme for Existing Measures 

For many existing measures, similarity (dissimilarity) of two strings can be 
viewed as the maximum (minimum) ‘score’ of ‘pattern’ that matches both of 
them. In this view, the differences among the measures are the choices of 

(1) the pattern set II to which common patterns belong, and 

(2) the pattern scoring function which assigns a score to each pattern in II. 

A pattern is generally a description that defines a language over an alphabet, 
but from practical viewpoints of time complexity, we here restrict ourselves to a 
string consisting of symbols and a kind of wildcards such as *. For example, if 
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we use the regular pattern set and define the score of a pattern to be the number 
of symbols in it, then we obtain the LCS length measure. In fact, the strings 
ACDEBA and ABDAC has a common pattern A*D*A* which contains three symbols. 

More formally, let S be the alphabet. A wildcard is an expression that 
matches one or more strings in S* . The set of strings that a wildcard 7 matches 
is denoted by L{j). Clearly, 0 C L{j) C S* . Let Z\ be a set of wildcards. A 
string over A U Z\ is called a pattern. Let L{a) = {a} for all symbols a in A. The 
language L(tt) of a pattern tt = 71 • • • 7™ is then defined to be the concatenation 
of the languages ^(71), . . . , L(7m). A pattern tt is said to match a string w in A* 
if w G a pattern tt is said to be a common pattern of two strings x and y 

in A* if it matches both of them, namely, x,y €: We are now ready to de- 

fine similarity and dissimilarity measures on strings. A similarity (dissimilarity) 
measure on strings over A is a pair (TT, <P) such that 

— TT = (A U Z\)* is a set of patterns, and 

— ^ is a function from TT to R, which we call pattern scoring function, 

where Z\ is a set of wildcards and R denotes the set of real numbers. The simi- 
larity (dissimilarity) of two strings x and y is then defined to be the maximum 
(minimum) value of <P(t:) among the common patterns tt G TT of a; and y. 

We introduce some notations to describe typical wildcards as follows. 

* : a wildcard that matches any string in A* ; 

4> : a wildcard that matches any symbol in A; 

: a wildcard that matches any string in A* of length n > 1; and 

4>(ui \ ■ ■ ■ \uk) ■ a wildcard that matches any of the strings u\, . . . ,Uk in A*. 

In addition, we use a pair of brackets, [ and ], to imply ‘optionality’. For example, 
[a] is a wildcard that matches both the empty string e and the symbol a G A. 
Similarly, the wildcard matches the empty string e and any string of length 
n. 

Using these notations, we show that most of the existing similarity and dis- 
similarity measures can be described according to this scheme. Let 

^1 = {*}) 

^2 = {<('}, 

^3 = If), [</>]}, 

A4 = I [a] I a G a} U {^(a|6) | a, 6 G A and a yf &}, and 
A5 = A4U I n > 1}. 

Let Ilk = (A U Ak)* for fe = 1 , . . . , 5. 

Example 1. The pair (TTi,^i) such that ^i(Tr) is the number of occurrences of 
symbols within pattern tt G TTi gives the similarity measure of the LCS length. 

Example 2. The pair (112,^2) such that ^2(71") is the number of occurrences of 
(j) within pattern tt G TT2 gives the Hamming distance. 
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Example 3. The pair (Usj'P^) such that <?3(7r) is the number of occurrences of 
(/) and [(j}\ within pattern tt C gives the Levenstein distance. 



Example 4- The pair (774,^4) such that ^4 : 7T4 — > i? is a homomorphism 
defined by ^4([a]) = (5(a,e), ^4(<^(a|6)) = S{a,b), and <74(0) = 0 for a,b G E, 
gives the generalized Levenstein distance mentioned in Section tz. 1 1 



Example 5. The pair (775,^5) such that ^5 : 11^ R is a, homomorphism 
defined in the same way as ^4 in Example 4 except that is the gap 

penalty for a gap of length n, gives the generalized Levenstein distance with gap 
penalties mentioned in Section iz.ll 

Although all the functions <Pi in the above examples are homomorphisms from 
Ili to 72, a pattern scoring function ^ does not have to be a homomorphism 
in general. In the next section, we give a new similarity measure under the 
framework which is suited for application to the problem of discovering the 
poetic allusion from anthologies of the classical Japanese poems. 

3 Similarity Measures on waka Poems 

We have two goals in finding similar poems in anthologies of WAKA poems. One 
goal is to identify poems that were originally identical, and where only a small 
portion has been changed accidentally or intentionally while being copied by 
hand. Pursuing reasons for such differences will provide clues on how the poems 
have been received and handed down historically. 

The other goal is to find instances of poetic allusion. In this case, by analyzing 
how poems alluded to earlier ones, it will be possible to generate and test new 
theories about the rhetorical device. 

In this section we consider how to define similarity between waka poems. 



3.1 Changes of Line Order 

In poetic allusion, a large proportion of the expressions from an earlier poem 
were used in a new one. A poet must therefore take care to prevent his poem 
from merely being an imitation. Fujiwara no Teika gave the following rules 
in his writings for beginners “Kindai SHUKA” (Modern excellent poems; 1209) 
and “Eiga no taigai” (A basis of composing poems; ca 1221). 

— The use of expressions from the poem alluded to should be limited to at 
most two lines and an additional three or four characters. 

— The expressions from the poem alluded to should be located differently. 

— The topic should be changed. 
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The second item forces us to consider all the possible correspondences be- 
tween the lines of two poems. Since a poem consists of five lines, there are 
5! = 120 different correspondences. We shall compute a best correspondence 
which maximizes the sum of similarities between paired lines, and define the 
similarity between the two poems to be the maximum sum. 

Consider the two poems of Fig. ^ If we use the LCS lengths as the simi- 
larity values between the paired lines, the permutation (1,4, 5,2,3) yields the 
correspondence that gives the maximum value 21 (see Table 0. 



Table 1. Best correspondence 



Kokinshu #147 


Shinkokinshu #216 


sim. 


1: HO-TO-TO-KI-SU 


1: HO-TO-TO-KI-SU 


5 


2: NA-KA-NA-KU-SA-TO-NO 


4: NA-KA-NA-KU-SA-TO-NO 


7 


3: A-MA-TA-A-RE-HA 


5: YO-SO-NO-YU-FU-KU-RE 


1 


4: NA-HO-U-TO-MA-RE-NU 


2: NA-HO-U-TO-MA-RE-NU 


7 


5: O-MO-FU-MO-NO-KA-RA 


3: KO-KO-RO-KA-NA 


1 



3.2 Evaluation of Similarity Measure 

To estimate the ‘goodness’ of a similarity measure, we need a sufficient number 
of examples of poetic allusion. The data we used here is from Shugyokushu, 
a private anthology by the priest Jien (1155-1225). The 100 poems numbered 
from 3,472 to 3,571 of this anthology were composed as allusive variations of 
Kokinshu poems, and the poems alluded to were identified by annotations. We 
used these 100 pairs of allusive variations as positive examples, and the other 
9,900 combinations between the two sets of 100 poems as negative examples. 

Using these examples, we estimated the performance of the measure based on 
the LCS length between the paired lines. It was found that 96% of the positive 
examples had similarity values greater than or equal to 10, and 96% of the 
negative examples had similarity values less than or equal to 10. This implies 
that an appropriate threshold value can classify the positive and the negative 
examples at high precision. 

Let us denote by Succp(t) the number of positive examples with a similarity 
greater than or equal to t divided by the number of all positive examples, and by 
SuccNit) the number of negative examples with a similarity less than t divided 
by the number of all negative examples. The best threshold t is then defined 
to be the one maximizing the geometric mean Succp{t) x Succ]\[{t). In the 
above case we obtained a threshold t = 11 which gives the maximum value 
V0.9200 X 0.9568 = 0.9382. 
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3.3 New Similarity Measure 

Now we discuss how to improve the similarity measure. See the following poems. 

Poem alluded to. (Kokinshu #315) 

YA-MA-SA-TO-HA / FU-YU-SO-SA-HI-SHI-SA / MA-SA-RI-KE-RU 
HI-TO-ME-MO-KU-SA-MO / KA-RE-NU-TO-O-MO-HE-HA 
Allusive-variation. (Shugyokushu #3528) 

YA-TO-SA-HI-TE / HI-TO-ME-MO-KU-SA-MO / KA-RE-NU-RE-HA 
SO-TE-NI-SO-NO-KO-RU / A-KI-NO-SHI-RA-TSU-YU 

The best correspondence in this case is given by the permutation (1, 5, 4, 2, 3). 
One can observe that the pairs (ya-ma-SA-to-ha, ya-to-SA-hi-te), (fu-yu- 
SO-SA-HI-SHI-SA, A-KI-NO-SHI-RA-TSU-YU), and (mA-SA-RI-KE-RU, SO-TE-NI-SO- 
NO-KO-Ru) have scores of 2, 1, and 1, respectively, although these pairs seem 
completely dissimilar. That is, these scores should be decreased. On the other 
hand, the pair (ka-re-nu-to-o-mo-he-ha, ka-re-nu-re-ha) has a score of 4, 
and it is relatively similar. 

These observations tell us that the continuity of the symbols in a common 
pattern is an important factor. Compare the common pattern ya*to* of YA- 
MA-SA-TO-WA and YA-TO-SA-HI-TE, and the common pattern karenu*ha of 
KA-RE-NU-TO-O-MO-HE-HA and KA-RE-NU-RE-HA. Thus we will define a pattern 
scoring function so that d>{*a*b*) < d>{*ab*). 

Let us focus on the length of clusters of symbols in patterns. For example, 
the clusters in a pattern *a*bc*d* are a, be, and d from the left, and their lengths 
are 1, 2, and 1, respectively. Suppose we are given a mapping / from the set 
of positive integers into the set of real numbers, and let the score <P{tt) of the 
pattern tt = *a*bc*d* be /(I) -I- /(2) -|- /(I). For our purpose, the mapping / 
must satisfy the conditions /(n) > 0 and /(n m) > f{n) f{m), for any 
positive integers n and m. There are infinitely many mappings satisfying the 
conditions. Here, we restrict / to the form f{n) = n — s(0<s<l). 

For a parameter s varied from 0 through 1, we computed the threshold t 
that maximizes the previously mentioned geometric mean. The maximum value 
of the geometric mean was obtained for a parameter s = 0.8 ~ 0.9 and for a 
threshold t = 8.9, and the value was #0.9600 x 0.9680 = 0.9604. 

4 Experimental Results 

In our experiment, we used two anthologies Kokinshu (compiled in 922; 1,111 
poems) and Shinkokinshu (compiled in 1205; 2,005 poems), which are known 
as the best two of the twenty-one imperial anthologies, and have been stud- 
ied most extensively. We computed the similarity values for each of the over 
2,000,000 combinations in order to find the Shinkokinshu poems that allude 
to Kokinshu poems (not forgetting that many Shinkokinshu poems allude to 
poems in anthologies other than Kokinshu). The similarity measures we used 
are as follows: 
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Table 2. Frequency distributions of similarity values 



(a) Measure A (b) Measure B 



sim. 


freq. cumulat. Ireq. 


sim. 


Ireq. cumulat. Ireq. 


23 


1 


T 


16-17 


2 


’Z 


22 


0 


1 


15-16 


1 


3 


21 


3 


4 


14-15 


4 


7 


20 


4 


8 


13-14 


8 


15 


19 


5 


13 


12-13 


26 


41 


18 


26 


39 


11-12 


32 


73 


17 


52 


91 


10-11 


77 


150 


16 


114 


205 


9-10 


137 


287 


15 


268 


473 


8- 9 


332 


619 


14 


916 


1389 


7- 8 


1066 


1685 


13 


3311 


4700 


6- 7 


3160 


4845 


12 


13047 


17747 


5- 6 


10089 


14934 


11 


50284 


68031 


4- 5 


35407 


50341 


10 


162910 


230941 


3- 4 134145 


184486 


9 


394504 


625445 


2- 3 433573 


618059 


8 


632954 


1258399 


1- 2 873904 


1491963 


7 


588882 


1847281 


0- 1 717547 


2209510 


6 


288190 


2135471 








5 


66873 


2202344 








4 


6843 


2209187 








3 


318 


2209505 








2 


5 


2209510 








1 


0 


2209510 








0 


0 


2209510 









Measure A. The maximum sum of similarities computed line-by-line using the 
LCS length measure. 

Measure B. The maximum sum of similarities computed line-by-line using the 
measure presented in Section n.3l 

Tabled shows the frequency distributions of similarity values. 

We first verified that changing the measure from A to B improves the results 
in the sense that most of the pairs which are not so similar as poems but had 
relatively high similarity, now have relatively low similarity. 

Next we examined the first 73 pairs in decreasing order of Measure B that 
have a similarity value greater than or equal to 11. It was found that 43 of 
the 73 pairs were indicated as poetic allusion in the standard editions of 
Shinkokinshu with annotations. The other 30 pairs were generally not consid- 
ered to be poetic allusions, although some of them seem to be possible instances. 
Note that such judgements are to some extent subjective. 

All but three of the first 15 pairs were identified as poetic allusion in 
One of the three exceptions seems actually to be poetic allusion, while this does 
not seem to be the case for the remaining two. The two had long expressions in 
common, HA-RU-KA-SU-MI TA-NA-HI-KU-YA-MA-NO and *NO-MI-TO-RI-SO I-RO- 
MA-SA-RI-KE-RU, respectively. However, both of these expressions are frequent in 
WAKA poems, so cannot be considered specific allusions. By considering the fre- 
quencies of the expressions, the similarity values of such pairs can be decreased. 
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It should be emphasized that the following pair, ranked 55th in Measure B, 
was apparently an instance of poetic allusion of which we can find no indication 
in |2I3| . 

Poem alluded to. (Kokinshu #826) 

A-FU-KO-TO-WO / NA-KA-RA-NO-HA-SHI-NO / NA-KA-RA-HE-TE 
KO-HI-WA-TA-RU-MA-NI / TO-SHI-SO-HE-NI-KE-RU 
Allusive-variation. (Shinkokinshu #1636) 

NA-KA-RA-HE-TE / NA-HO-KI-MI-KA-YO-WO / MA-TSU-YA-MA-NO 
MA-TSU-TO-SE-SHI-MA-NI / TO-SHI-SO-HE-NI-KE-RU 

It is considered that this pair has been overlooked in the long research history 
of WAKA poetry. Note that the rank of this pair in Measure A was 92 ~ 205. 

The experimental results imply that the proposed measure is effective in 
finding similar poems. We cannot give a precise evaluation of the results since 
we have no complete list of instances of poetic allusion between the anthologies. 

5 Discussion and Future Work 

One can observe in the following two poems that the strings yoshino and yama, 
originally in the second line, appear separately in the first and the second lines 
of the new poem. 

Poem alluded to. (Kokinshu #321) 

FU-RU-SA-TO-HA / YO-SHI-NO-NO-YA-MA-SHI / CHI-KA-KE- RE-HA 
HI-TO-HI-MO-MI-YU-KI / FU-RA-NU-HI-HA-NA-SHI 

Allusive-variation. (ShinkokinshO #1) 

MI-YO-SHI-NO-HA / YA-MA-MO-KA-SU-MI-TE / SHI-RA-YU-KI-NO 
FU-RI-NI-SHI-SA-TO-NI / HA-RU-HA-KI-NI-KE-RI 

Our similarity measure is not suited for such situations since we considered 
only line-to-line correspondences. The following improvement will be appropri- 
ate. For strings iti, . . . , {n> 1), let 7 t(mi, . . . , u„) be an extended pattern 

that matches any string in the language L(*Uc,(i)* ’ ’ ’ *Wcr(n)*) for every permu- 
tation a of {1, . . . , n}, and define the score to be the sum f{\ui\) for some 

function /. 

As a preliminary result, we have shown that the function / defined by /(I) = 
0 and /(n) = n for n > 1 gives the geometric mean #0.9900 x 0.9507 = 0.9702, 
which is better than those for Measures A and B. 

As mentioned in the previous section, a highly frequent expression could not 
allude to a particular poem. Hence, a new idea is to assign a smaller score to 
a pattern if it is not rare, i.e., it appears frequently in other strings (poems). 
The rarity of common patterns can be formalized in terms of machine learning 
as follows. Suppose we are given a finite subset S of 11+, and we have only 
to consider the similarity on the strings in S. Let x,y G S with x # y. Let 
us regard Pos = {x,y} as positive examples and Neg = S — Pos as negative 
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examples. The rarest common pattern tt of x and y with respect to S is the one 
satisfying Pos C L(tt) and minimizing the one side error \NegnL(T:)\, equivalently 
minimizing [S' H the frequency of tt in S. 

Significance of common patterns depends upon both their scores and their 
rarity. Our future work will investigate ways of unifying the two criteria. 
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Abstract. We consider Maximum Agreement Problem which is, given 
positive and negative documents, to Hnd a characteristic set that matches 
many of positive documents but rejects many of negative ones. A char- 
acteristic set is a sequence [xi, . . . ,Xd) of strings such that each Xi is 
a suffix of Xi+i and all XiA appear in a document without overlaps. A 
characteristic set matches semi-structured documents with primitives or 
user’s dehned macros. For example, ( “set” , “characteristic set” , “</title> 
characteristic set”) is a characteristic set extracted from an HTML Hie. 
But, an algorithm that solves Maximum Agreement Problem does not 
output useless characteristic sets, such as those made of only tags of 
HTML, since such characteristic sets may match most of positive doc- 
uments but also match most of negative ones. We present an algorithm 
that, given an integer d which is the number of strings in a characteristic 
set, solves Maximum Agreement Problem in 0{n^h‘^) time, where n is 
the total length of documents and h is the height of the suffix tree of the 
documents. 



1 Introduction 

The progress of computer technologies makes it easy to make and store large 
amount of document files such as HTML/XML files. E-mails, Net News articles, 
etc. These amounts of documents can be thought of large scale databases. From 
such a database, it becomes to be important to discover unknown and interesting 
knowledge. 

Many researchers make efforts to develop tools and algorithms which dis- 
cover interesting rules or patterns from well-structured databases How- 

ever, such tools and algorithms on well-structured databases is not suitable for 
semi-structured databases. A semi-structured database is a database of semi- 
structured files, which have tree-like structures, such as HTML/XML files, BiB- 
TeX files, etc pp. A node of the structure is represented by a directive, such as 
a tag in HTML. The node may have any number of children. In mm , tex- 
tual data mining methods from such semi-structured databases or those without 
structure are developed. 

In this paper, we consider Maximum Agreement Problem, which is, given 
positive and negative documents, to find some form of rules or patterns which 
satisfies some measure. Nakanishi et al. m consider that a rule consists of a 



S. Arikawa, K. Furukawa (Eds.): DS’99, LNAI 1721, pp. 139-^33 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



140 



Daisuke Ikeda 



single string and the measure is that the string matches all positive documents 
but does not any negative ones. They show that the problem is solvable in linear 
time. Arimura et al. [ 7 ] consider that a rule is a proximity word association 
pattern (rci, W 2 ',k), where W\,W 2 are strings and k is an integer, and defines the 
pattern matches a document if the document has wi and W 2 as substrings of it 
in this order and the distance between wi and W 2 is less than k. As a measure, 
they adopt the maximum agreement which is, among all patterns, the maximum 
number of positive documents matched with a pattern minus that of negative 
ones. They show that the problem is solvable in 0{mn^) time using suffix trees 
as data structure like in |^, where m is the number of all documents and n is 
the total length of them. Then, Arimura and Shimozono ^ extend the above 
pattern to those in which d > I strings are contained for fixed d and show that 
the problem is solvable in 0{k^~^h'^n\ogn) time, where h is the height of the 
suffix tree of the documents. 

We also adopt the maximum agreement as the measure because it is known 
that an algorithm that solves Maximum Agreement Problem efficiently is robust 
for noisy inputs, and works well even if the algorithm does not have any knowl- 
edge about the structure behind the target rules or patterns EU- This measure 
reduces the number of useless patterns or rules, such as those made of only tags 
of HTML, since such patterns may match most of positive documents but also 
match most of negative ones. And this reduction does not require the knowledge 
about the format or grammar of input documents. 

As a pattern, we consider a finite sequence of strings (cci, . . . , 2 ;^), called a 
characteristic set, such that Xi is a suffix of Xi+\ for each i (1 < i < d — 1). 
For the sequence, we define it matches a document if each Xi appears in the 
document and each pair of Xi and Xj {i j) has no overlap on the document. This 

definition has following good properties: (1) There is no restriction on distance 
between strings. Due to this, characteristic sets whose strings widely spreads in 
a document are found. (2) There is no restriction on the order of appearances of 
strings. (3) The restriction for a string to be a suffix of its predecessor reduces the 
complexity of a mining algorithm. In spite of the restriction, a characteristic set 
seems to be enough powerful to express a set of important or interesting strings 
in semi-structured data. Because the sequence (“string” “>important string”, 
“</title>important string”) is a characteristic set which include a tag or a part 
of tag. String with any tags including user’s defined ones of semi-structured data 
can be expressed by a characteristic set. 

The straightforward algorithm that enumerates all possible substrings then 
constructs a characteristic set from each substring requires time. In 

this paper, we present a simple and efficient algorithm that, given an integer d, 
solves Maximum Agreement Problem in 0{n^h'^) time, where n is the total 
length of documents and h is the height of the suffix tree of input documents. 
This algorithm also uses suffix trees like j^. But, our algorithm more simple and 
efficient because their algorithm requires 0{k^~^h^n\ogn) time where k is the 
distance between two consecutive strings in a pattern. If input documents are 
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uniformly random, it is known that the height h is small compared to n |S||- In 
this case, the algorithm runs extremely faster. 

This paper is organized as follows. Basic notations and definitions are given in 
Section 0. The main algorithm is described in Section |^1 and its correctness and 
complexity are shown in Section^ Discussion and feature works are described 
in Section 0 

2 Preliminaries 

The cardinality of a finite set T is denoted by \T\. The set if is a finite alphabet. 
Let a: = ai • • • a„ be a string in S* . We denote the length of x by |a;|. For an 
integer 1 < f < |x|, we denote by x[i\ the ith letter of x. The concatenation of 
two strings x and y is denoted by cc • y or simply by xy. For a string x, if there 
exist strings u,v,w G S* such that x = uvw, we say that u (resp. v and w) is 
a prefix (resp. a substring and a suffix) of x. 

Let y be a substring of a string x G S*. An occurrence of y in x is & positive 
integer i{l <i < |x|) such that x[i] ■ ■ •x[i-|-|y| — 1] = y. Let y and z be substrings 
of X G S* . Then, we say that y and z have the overlap on x if there exist the 
same occurrences of y and z in x. 

A characteristic set tt of strings over A is a finite sequence tt = (xi, . . . , Xd) 
of strings in A* such that Xi is a prefix of x^+i for each 1 < i < d — 1. The 
characteristic set tt matches a string x G A*Q if (1) there exists an occurrence 
of Xj in X for each 1 < j < d and (2) each pair of Xi and Xj has no overlap on x. 
Note that strings in a characteristic set are allowed to occur in x in any order. 
Thus, the characteristic set may match possiblely nl strings. 

A sample is a finite set S = {si,...,Sm} of strings in A*. An objective 
condition over S' is a binary labeling function ^ : S ^ {0,1}. Each string Si in S 
is called a document. For a document s, we call it a positive document if ^(s) = 1 
and a negative document if ^(s) = 0. 

Let S be a sample and ^ an objective condition over S. Then, for a string 
s G S, we say that a characteristic set tt agrees with ^ on s if tt matches s if and 
only if ^(s) = 1. For tt, the agreement on S and ^ is the number of documents 
in S on which tt agrees with 

Definition 1 (Arimura and Shimozono [6]). Maximum Agreement Problem 
is, given a tuple (A, S, ^,d), to find a characteristic set containing d strings 
over A that maximizes the agreement for the characteristic set on S and ^ for 
all over characteristic sets, where A is an alphabet, S C A* is a set of documents, 
^ : S ^ {0, 1} is an objective condition, and d is a positive integer. 

Let $ the special letter such that a yf $ for any a € A and $ yf $. A text 
is a string A that ends with $. For a text A and an integer p (1 < p < |A|), 

^ There is no essential difference between that Xi is a suffix of its successor and that Xi 
is a prefix of its successor. Thus, we use the prefix restriction because of a technical 



reason. 
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Ap denotes A’s suffix starting at the pth letter of A. Let Ap^ , Ap^ , ■ ■ ■ , Ap^ be 
all suffices of A in lexicographic ordei0. The suffix tree for A is the compact 
trie for Ap^ , Ap^, . . . , Ap^. For each node v of the tree, BS(v) denotes the string 
obtained by concatenating all strings labeled on edges of the path from the root 
to f . A subword x on an edge is encoded by the occurrence of x, so that BS{v) 
for any node v is computable in constant time. We assume that, for each node 
of the tree, its children are sorted in lexicographical order of strings labeled on 
edges between v to them. 

Example 1. Let s = abcbcabc% be a text. The suffix tree of the text s is described 
in Fig.0 For nodes u and v in Fig. 0 we have BS{u) = abc and BS{v) = bcabc$. 




We call BS{v) a branching string and a characteristic set containing only 
branching strings to be canonical form. For two strings x,y €: A*, x and y are 
equivalent over a string A € A*, denoted by x =a y, if and only if they have the 
same set of the occurrences in A. For characteristic sets tt = (xi, . . . ,Xd) and 
r = (yi, . . . , yd), we define tt and t are equivalent over A if and only if pj =a qj 
for all I < j < d. 

Lemma 0 is the key lemma that reduce the time complexity originally due 

to icn] . 

Lemma 1 (Arimura and Shimozono j^). Let A be a text and x be a sub- 
string of A. Then there exists a node v in the suffix tree of A such that x and 
BS{v) are equivalent over A. 

Proof. If there exists a node u such that x = BS{u), the proof completes. So, we 
assume that there is no such node. In this case there exist a nodes v uniquely 
such that X is a prefix BS(y) and BS{w) is a prefix of x, where w is the parent 
of V. From the definition of the suffix tree, between v and w there is no node. 



^ We define that x is the predecessor of x$ for any x G E* in the order. 
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Therefore, both x whose end appears on the edge between v and w, and v have 
the same descendants, so that they have the same leaves as its descendants. 
Thus, the set of occurrences of x and BS{v) are the same since the labels on 
leaves gives the set of occurrences. 

From this lemma, for any characteristic set tt = {xi, . . . ,Xd), there exists 
an equivalent characteristic set tt' in canonical form. From the definition to be 
equivalent, the numbers of documents matched with tt and tt' are the same. 
Thus, it is enough to consider only characteristic sets in canonical form. 



3 Mining Algorithm 

In this section, we demonstrate the main algorithm that solves Maximum Agree- 
ment Problem efficiently. 

Let (A, S', d) be an instance, where S is an alphabet, S C S* is a sample 
set, ^ : S ^ {0,1} is an objective condition, and d > 0 is an integer. Let m be 
the number of documents in S, that is, m = |S|, and n be the total length of all 
documents in S. 

First, we consider a straightforward algorithm that enumerates all possible 
substrings, construct a characteristic set from each substring, checks the agree- 
ment for each of them, and outputs the best characteristic set. Since, for a fixed 
string w, there exist 0{\w\^) substrings, there exist possible characteris- 

tic sets. For each characteristic set, the algorithm counts matched documents by 
a typical pattern match algorithm and computes the agreement using the count- 
ing result. It is required 0(|s| -I- occ) time to check a characteristic set matches a 
document s using Aho-Corasick algorithm, where occ is the number of all occur- 
rences of all strings in the characteristic set. Thus, this straightforward algorithm 
requires time complexity. 

Next, we develop more efficient algorithm Ejficient_Miner which is described 
in Fig. 0 First, the algorithm creates the text A by concatenating all documents 
in S with $ to show boundary. Then, it creates the suffix tree Ttcca of A in 0(n) 
time M And then, it calls the procedure Discover which finds a characteristic 
set maximizing agreement among all possible characteristic sets. 



Procedure EfficientJHner 

{Input: an integer d > 1, a sample S C S* , and an objective condition ^ : S 

{ 0 , 1 }} 

begin 

A := si$ • • • ; 

create the suffix trees TresA of A; 

Discover (d, TreeA) ; 

output a characteristic set that maximizes the agreement; 
end ; 



Fig. 2. The algorithm that finds the best characteristic set. 
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The procedure Discover is described in Fig. 0 The first argument d is to 
show how many strings it must find. The procedure Discover enumerates all 
paths of a given suffix tree Tree a- We assume that a path is denoted by a 
sequence {v\,. . . , Vh) of nodes such that Vi is the parent of Vi+\. For each path 
{vi, . . . ,Vh), it again enumerates all combinations of sequences {vi^ , ,Vi^) of d 
nodes such that Vi^ is a descendant of Vi- {j < k) . For each such a sequence, the 
procedure Discover construct the characteristic set tt as follows. 

7T = {BS{vi),...,BS{vh)). 

Then, it counts the number of positive and negative documents that are matched 
with the characteristic set. While this counting, the it ignores if some two strings 
in a characteristic set have a same occurrence. From these counts, the agreement 
for the characteristic set is directly computable. 



procedure Discover (d, TreeA) 

{an integer d>l and a suffix tree TreeA-} 

begin 

for each path (vi,...,Vh) in TreeA do 

foreach (tij , . . . , from {vi, . . . ,Vh) do begin 
{ij < ik if j < k.} TTd ■= {BS{ui),. . . ,BS{ud)); 

count the number of positive and negative document matched 
with TTd; 

if the agreement for is bigger than those ever found 
then keep this value and the current nd ; 
end 
end 
end ; 



Fig. 3. The procedure called from the main algorithm. 



4 Correctness and Complexity 

In this section, we show that the algorithm in Fig.^works correctly and estimate 
its complexity. We begin with the proposition that shows a basic property of the 
suffix tree. 

Proposition 1. Let x,y, z € D* be substrings of a text A such that x = yz, and 
u, V be nodes of the suffix tree of A such that BS{u) = x and BS{v) = y. Then u 
is a descendant of v. 

Example 2. In Fig. 0of Example ^ 6c is a prefix of 6ca6c$ and t>, w are nodes 
such that BS{w) = be and BS(v) = bcabc%. The node n is a descendant of w. 

The correctness of the algorithm is shown in the next theorem. 
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Theorem 1. The algorithm Find-Best-Patten correctly solves Maximum Agree- 
ment Problem. 

Proof. First, the procedure Discover enumerates all paths of the suffix tree Tree a 
of A. Then, it selects d nodes , . . . ,Vi^) from each path {vi, . . . , Vh). For each 
1 < A: < d, is a descendant of Vi^ for any j < k. Therefore, BS{vi.) is a prefix 
of BSiviff) from Proposition d 

By Lemma ^ it is enough for the algorithm to consider characteristic sets 
in canonical form. So, the procedure Discover enumerates only d nodes of the 
suffix tree of A, and it creates characteristic sets from them. This assures that 
the procedure Discover outputs a characteristic set in canonical form. 

When the procedure Discover counts the number of matched documents, it 
ignores if some two strings in a characteristic set have a same occurrence. From 
this, it does not output a characteristic set in which strings have an overlap. 

Finally, the procedure Discover computes the agreement from the counting 
result, and keeps its value when the current characteristic set gives the maximum 
agreement among characteristic sets ever found. Thus, Efficient-Miner solves the 
Maximum Agreement Problem correctly. 

Next we estimate the time complexity required by Efficient-Miner. 

Theorem 2. The procedure Efficient-Miner in Eig. solves Maximum Agree- 
ment Problem in 0{n^h'^) time, where n = |A|. 

Proof. The procedure Efficient-Miner in Fig.0first creates the text A. Then, it 
creates the suffix tree of A in 0{n) time 

Let T be the time complexity of the procedure Discover in Fig. 0 There 
exist n paths in the suffix tree TrecA of A. For each path of the tree, there are 
h'^ combinations of nodes on the path, where h is the height of the ree. Thus, 
there exist 0{nh‘^) possible characteristic sets. 

For each characteristic set of them, the procedure Discover count the number 
of documents matched with the characteristic set. This is done by Aho-Corasick 
algorithm in 0{n -\- occ{TTd)) time, where occiiTd) the number of all occurrences 
of strings in tt^. Thus, the following equation 

T = nh‘^{n -\- occ{TTd)) 

holds. It is obvious that occ(Trd) < n for any tt^. Therefore, the time complexity 
required by the procedure Discover is T = n^h'^. Thus, the total time complexity 
is 0{n -\- n^h'^) = 0{n^h'^). 

5 Conclusion 

We considered characteristic sets as a rule of Maximum Agreement Problem. 
This is very simple but suitable for a number of long semi-structured or no- 
structured documents. Because a string in a characteristic set may appear any 
position except other string appear at the position and a string in a characteristic 
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set may contain any primitives, such as tags in HTML/XML files, with some 
interesting strings. 

If it is denied for users to define their own macros in some data format, 
such as HTML, together with the table of the reserved primitives, an algorithm 
that finds common substrings like 0 can discover interesting substrings with 
its primitives for only this data format. But, our algorithm is for any format. 
Moreover, for a data format that allows users to define their own macros or tags, 
such as HTgX and XML, the above method does not works well because it is 
difficult to construct the table of user defined macros. 

We developed a simple and efficient algorithm that finds a characteristic 
set maximizing the agreement. This works in 0{n^h'^) time for the worst case, 
where n is the total length of input documents, h is the height of the suffix tree 
of the documents, and d is a constant that decides the number of strings in a 
characteristic set. However, this algorithm requires that input documents must 
have partitioned into two sets. One solution is to use the keyword search before 
applying this algorithm. Documents matched with the search are to be positive 
and other documents are to be negative. 

If input documents are uniformly random, it is known that the height h is 
small compared to n |2| . So, to evaluate the height for real data, such as HTML 
files, by experiments is an future work. 

We defined characteristic set tt = {x\, . . . ,Xd) such that each Xi is a prefix 
of Xi+i- It is easy to modify into that Xi is a suffix of Xi+i. But, to extend to 
that Xi is a any substring of Xi+\ is an interesting open problem. More generally, 
to find a better rules or patterns of Maximum Agreement Problem is also an 
important feature work. 
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Abstract. We consider the problem of finding two-dimensional associ- 
ation rules for categorical attributes. Suppose we have two conditional 
attributes A and B both of whose domains are categorical, and one bi- 
nary target attribute whose domain is { “positive” , “ negative” }. We want 
to split the Cartesian product of domains of A and B into two subsets 
so that a certain objective function is optimized, i.e., we want to find 
a good segmentation of the domains of A and B. We consider in this 
paper the objective function that maximizes the confidence under the 
constraint of the upper bound of the support size. We first prove that 
the problem is NP-hard, and then propose an approximation algorithm 
based on semidefinite programming. In order to evaluate the effectiveness 
and efficiency of the proposed algorithm, we carry out computational ex- 
periments for problem instances generated by real sales data consisting 
of attributes whose domain size is a few hundreds at maximum. Approxi- 
mation ratios of the solutions obtained measured by comparing solutions 
for semidefinite programming relaxation range from 76% to 95%. It is 
observed that the performance of generated association rules are signifi- 
cantly superior to that of one-dimensional rules. 



1 Introduction 



In recent years, data mining has made it possible to discover valuable rules 
by analyzing huge databases. Efficient algorithms for finding association rules 
have been proposed and classification and regression trees that 

use these rules as branching tests have been extensively studied 

One of important application fields of data mining is marketing. In particular, 
we are interested in developing an effective strategy of direct mail distribution 
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based on customer purchase data accumulated in databases. Our research stems 
from the following real problem related to direct mail distribution. Kao, a leading 
manufacturer of household products in Japan, developed a new brand Lavenus in 
1996 which covers several different categories of products ranging from shampoo 
& conditioners, hair care products, hair dye and so on. But, for the first one 
year, it was not well recognized by customers. Kao then decided to start a joint 
sales promotion with Pharma, which is a drugstore chain in Japan that has 
approximately three million members (Pharma has been maintaining detailed 
data of customers’ purchase data for more than ten years. See m for data 
mining activities of Pharma). The sales promotion was done by sending direct 
mails to potential users of Lavenus. In order to establish an effective direct mail 
plan, it is crucial to select customers who have high potential to buy Lavenus in 
the near future. 

For this purpose, Pharma investigated purchase data of Lavenus users, i.e., 
customers who belong to Pharma’s member club and had already bought Lavenus, 
and identified commodities that are sold well for them. Under the hypothesis 
that those who frequently buy these commodities but have not yet purchased 
Lavenus are possibly likely to become Lavenus users in near future, Kao sent 
direct mails with free sample coupon to such customers. Kao and Pharma ob- 
served the effectiveness of this sales promotion. However, this result raises the 
following question: What is the best strategy to find customers to be targeted 
for direct mail promotion? This question motivates our study. 

In view of this motivation, hoping that more refined rules may possibly find 
customers that have higher response rate to direct mails, we study the problem of 
finding two-dimensional association rules for categorical attributes. Association 
rules that classify a target attribute will provide us with valuable information 
which may in turn helps understand relationships between conditional and target 
attributes. Association rules are used to derive good decision trees. As pointed 
out by Kearns and Mansour El, the recent popularity of decision trees such 
as C4.5 by Quinlan m is due to their simplicity and efficiency and one of the 
advantage of using decision trees is potential interpretability to humans. 

One-dimensional association rules for categorical attributes can be efficiently 
obtained ED|. On the other hand, as will be shown in this paper, finding two- 
dimensional association rules for categorical attributes is NP-hard. Nevertheless 
we shall develop a practically efficient approximation algorithm for obtaining 
two-dimensional association rules for categorical attributes. One of the advan- 
tages of two-dimensional association rules over one-dimensional ones is that two- 
dimensional rules usually induce a decision tree of smaller size that has a higher 
classification ability. 

We assume that a database consists of only categorical attributes. Let TZ be 
a database relation. We treat one special attribute as a target attribute. Other 
attributes are called conditional attributes. We assume in this paper that the 
domain of a target attribute is {0, 1}, i.e., 1 means “positive” response and 0 
“negative”. Among conditional attributes, we focus on two particular attributes 
A and B. Let dom(A) and dom(i?) denote the domain of A and B, respectively. 
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Let nA = |dom(^)| and ns = |dom(_B)|, and let U C dom(A), V C dom(B). For 
notational convenience, let S = U x V and S = dom(A) x dom(i?) — {U x V)^ 
where dom(A) x dom(_B) denotes the Cartesian product of dom(A) and dom(i?). 
We then split dom(A) x dom(i3) into {S, S). Ideally, we want to find S for which 
all records t G TZ with (t[A],t[i3]) G S take value ‘1’ in a target attribute, while 
other records t gTZ with (t[j4], t[S]) G S take value ‘O’. Since such segmentation 
is impossible in general, we introduce a certain objective function f{S, S) that 
evaluates the goodness of the segmentation. 

We consider in this paper the problem that maximizes the confidence under 
the constraint that the support does not exceed a given threshold. We trans- 
form the problem into the one that finds a dense subgraph on weighted bipartite 
graphs appropriately defined. We first prove its NP-hardness by reduction from a 
balanced complete bipartite subgraph problem which is known to be NP-complete 
(see mm). We shall then focus on an approximation algorithm. We propose 
in this paper an approximation algorithm based on a semidefinite programming 
(SDP) relaxation. The idea of relaxation is similar to the one by Srivastav and 
Wolf for the densest subgraph problem on general graphs. The densest sub- 
graph problem has recently been studied by several researchers. For the case 
where weights satisfy the triangle inequality, Arora et al. 0 proposed a PTAS 
(polynomial-time approximation scheme) for k = C(n), where fc is a problem 
parameter that constrains the lower bound on the vertex size of subgraphs we 
seek for. For general case, only approximation ratio is known for 

general k cni. For k = 0(n), a few papers presented algorithms constant ap- 
proximation ratios After a solution for SDP relaxation is obtained, the 

conventional approach obtains a rounded solution based on random hyperplane 
cutting. On the other hand, we introduce a refined rounding technique by making 
use of the special structure of bipartite graphs. Although we have not yet ob- 
tained an improved approximation ratio for our problem, we have implemented 
the proposed algorithm and carried out computational experiments to see its 
effectiveness and efficiency. In our experiments, we have employed SDPA which 
is a software developed by one of the authors 0 for semidefinite programming 
problems. Although SDP relaxation is known to be powerful in approximately 
solving densest subgraph problems, there seems to be no report on computa- 
tional experiments, as far as the authors know. Thus, this paper seems to be the 
first to report computational results for SDP-based approximation algorithms 
for such problems although our algorithm is limited to bipartite graphs. 

Problem instances we have solved have sizes (i.e., ha x ub) ranging from 
1,600 to 100,000, all of which are obtained through Pharma from real sales data 
related to Lavenus sales promotion. We observe that the proposed algorithm 
efficiently produces good approximate solutions in general for both small and 
large problem instances. In fact, the average of ratios of the objective value of 
the obtained solutions to that of SDP solutions exceeds 85%. We also observe 
that the obtained two-dimensional association rules outperform one-dimensional 
rules in solution quality. Therefore, we believe that the proposed approach will 



Approximation of Optimal Two-Dimensional Association Rules 



151 



be effective for various data mining applications that deal with categorical at- 
tributes. 

2 Problem Formulation 

For simplicity, we assume dom(A) = {1, 2, . . . , dom(B) = {1,2, . . . ,ns}- 
Let n = HA + nB- Let Sij denote the number of records t such that t[A\ = i and 
t[B] = j. Among such Sij records, let hij denote the number of records such that 
t[C] = 1 (i.e., the value of a target attribute C is positive), and let hij = stj — hij. 
For R C dom(A) x dom(i?), let 



Suppose we are a given a two-dimensional association rule r defined over con- 
ditional attributes A and B and a target attribute C. Let S denote the set of 
(*,j) G dom(A) X dom(i?) such that a tuple t with t\A\ = i,t\B] = j satisfies 
the condition of rule r. Then the rule r can be identified with S. Then s{S), the 
number of tuples satisfying the condition of r, is called the support of S. 

h{S), the number of tuples t that satisfies the condition of r as well as a 
target condition (i.e., r[C] = 1), is called the hit of r. The ratio h{S)/s{S) is 
called the confidence of S, or conf{S). 

We are interested in a rule r such that the corresponding subset S of dom(A) x 
dom(B) takes the form of .S' = C/ x 1/ for some U C dom(A) and V C dom(B). 

The problem we consider in this paper is to maximize h{S) under the con- 
straint of h{S) < M, and is formulated as follows. 



where M is a given positive constant. The practical role of the constraint 0 in 
terms of our application of direct mail distribution is to control the number of 
customers targeted for direct mail. 

Lemma 1. Prohlem P is NP-complete. 

Proof. Reduction is done from problem balanced complete bipartite subgraph 
which is known to be NP-complete mm- Given a bipartite graph G = (U, V, E) 
and a positive integer m, it asks whether the graph contains a subgraph Km,m 
in G. Letting m be a prime integer satisfying > n, we construct an instance 
of problem P as follows. Letting dom(A) = U and dom(R) = V, define hij = 1 
if e = (i,j) G E for i G U,j G V and hij = 0 otherwise. In addition, define 
Sij = 1 for all i,j with i G U,j G V . We set M = m^. Then, it is easy to see 
that the instance has a solution of objective value M = m? (i.e., confidence of 
one) if and only if the bipartite graph has a subgraph Km,m- 

From this lemma, we then focus on an approximation algorithm in the next 
section. 



s(i?) — ^ ) Sij,h{K) — ^ ( hij. 






P : maximize h{S) 

subject to S = U X V,U C dom(A), P C dom(R) 
s(S') < M, 



( 1 ) 

(2) 

( 3 ) 
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3 Approximation Algorithm 

The problem of finding a densest subgraph in general graphs has been studied by 
several authors as mentioned in Section 1, and approximation algorithms have 
been proposed. Up to now, neither constant- nor logarithmic-approximation al- 
gorithms have been proposed. Since we are focusing on our attention to bipartite 
graphs, it may be possible to develop better approximation algorithms. Although 
we have not yet succeeded yet in such an attempt, we formulate the problem P 
as a semidefinite programming problem in a standard way used in and we 
shall see its approximation ability through computational experiments. 

In the formulation as SDP, we introduce an integer variable Xi for each i G 
dom(A) = {1,2,..., ua} and a variable yj for each j G dom(i?) = {1,2,..., ns}- 
We interpret Xi as Xi = 1 if i € U and Xi = —1 otherwise. Similarly, yj is 
interpreted as yj = 1 if j G V and yj = —1 otherwise. 

Then the problem P can be rewritten as follows: 

P : maximize j + a::i)(a;o + 2/j) 

subject to i Sij{xo + Xi){xo + yj) < M, (4) 

2:0 = 1 . 

For the ease of exposition, by letting x = (xq, xi, , Xua ) and y = (j/i, 2 / 2 , • ■ • , 
yris), we introduce an {n + l)-dimensional vector z = (x,y) such that z, = Xi 
for i with 0 < i < and z, = yi-UA for i with + 1 < * < nA + 'Ub(= n). In 

addition, we define 

{ hij-UAl‘^ for 1 < f < and ua + f < j < nA + u-s, 

j/2 for UA + f < i <nA + riB and 1 < j < ua, 

0 otherwise. 

We then have the following formulation equivalent to (0J. 

P : maximize j /ib (zq -k Zi ) (zq -k Zj) 

subject to i YJj=i Sij{.ZQ + Zi)(zo -k Zj) < M, 

Zq = 1. 

In the same manner as is taken in existing semidefinite programming relax- 
ations, we relax the integrality constraint and allow the variables to be vectors in 
the unit sphere in Letting Bi be the unit sphere in the relaxation 

problem can be written as follows: 

SDPi : maximize j E”=i + Zi) ■ {zo + Zj) 

subject to i E”=i + Zi) ■ (zq + Zj) < M, (5) 

zo = (1,0,0,. ..,0). 

Introducing the variable Vij with Vij = Xi ■ Xj the above problem can be rewritten 
as follows: 

SDP 2 : maximize \ + ^oi + voj + Vij) 

subject to J Y.'i=i I]"=i Sij (1 + voi + voj + Vij) < M, 

Uii = 1 for i = 0, 1, . . . , n, 

Y = {vij} : symmetric and positive semidefinite. 



( 6 ) 
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This problem can be solved within an additive error 5 of the optimum in time 
polynomial in the size of the input and log(l/<5) by interior point algorithms. 
After obtaining an optimal solution v*j for SDP 2 , we can obtain an optimal 
solution z* of SDPi by a Cholesky decomposition. We then round each unit 
vector z* to -1-1 or —1. In the conventional method, this is done by choosing 
a random unit vector u on Bi and by rounding z* to 1 or —1 depending on 
the sign of the inner product of u and z*. Let z denote the rounded solution so 
obtained. 

Among several softwares for SDPs that are currently available, we use SDPA 
(Semi-Definite Programming Algorithm) [Q in our implementation. SDPA is a 
C-|— I- implementation of a Mehrotra-type primal-dual predictor-corrector interior- 
point method il5|18j for solving the standard form of SDP. The SDPA incor- 
porates data structures for handling sparse matrices and an efficient method 
proposed by Fujisawa et al. [gj for computing search directions for problems 
with large sparse matrices. 

In order to obtain reasonably good solutions from SDP solutions, we imple- 
ment the following two algorithms. 

Algorithm 1: Instead of using the conventional randomized rounding scheme, 
the algorithm uses a refined way to round an SDP solution. We do not choose a 
random unit vector, but instead we try all vectors z^ with 1 < fc < ha- In fact, 
from our experimental results, it is worthwhile to try many different vectors in 
order to obtain better approximate solutions for P. Namely, for a fixed z^ with 
1 < A: < (we call it a basis vector), we round z* to 1 or —1 for other i ^ k 
with 1 < i < n. 4 , depending on the sign of the inner product of z^ and z*. We 
round z| to 1. Let for i with 1 < i < ua be the rounded solution so obtained. 

For each rounded solution {x^ | 1 < f < tia] with 1 < fc < ua, we obtain a 
rounded solution {i/j \ 1 < j < ub} from z* with ua + 1 < J < based on the 
following lemma. 

Lemma 2. For problem P, when a rounded solution {xf | 1 < * < with 
fc = 1,2, ... ,UA is given, the optimal assignment of {y^ | 1 < j < ub} can be 
obtained by solving a knapsack problem with a single constraint. 

Proof. For problem P, introducing a new variable y) = (yo + Uj)l‘^ which takes 
0 or 1, Problem P becomes an ordinary knapsack problem. 

Based on this lemma, in order to solve P, we solve ua distinct knapsack problems 
because we try ua distinct basis vectors. Thus, we obtain ua feasible solutions 
of P. In addition, by exchanging the role of x and y, we do the above task in 
the same way to get ub feasible solutions as well. Thus, we obtain n feasible 
solutions of P in total. The algorithm outputs the one that maximizes the ob- 
jective function of P. In our implementation, we do not exactly solve knapsack 
problems. Instead we use the heuristic developed by Kubo and Fujisawa [LZLwho 
have shown that the heuristic is fast and produces very good solutions. — 

Algorithm 2: In order to obtain a feasible solution from an SDP solution z*, 
we adopt the conventional rounding method explained above. The solution z so 
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obtained may violated the constraint of (0 or it has enough room for improve- 
ment (® is satisfied and some Zi can be changed from —1 to 1 to increase the 
objective value without violating ( 0 ). 

(1) If 5 is feasible, we check whether changing Zi from —1 to -1-1 still preserves 
the constraint ©• Let I be the set of such i. For each j e /, let and be the 
increase of the objective function of P and the left-hand side of the constraint of 
0 according to the change of Zi from —1 to -1-1. The algorithm chooses i* G I 
with the largest /5f and sets Zj. = 1 . Deleting i* from I, we repeat this process 
as long as the current solution is feasible. This is a greedy-type algorithm. 

The solution obtained in this manner is then further improved in a manner 
similar to Algorithm 1. Namely, for a given solution z, we fix its cc-part (resp. 
y-part) while we treat the y-part (resp. a;-part) as free variables in order to 
maximize the objective function of P. This problem is again a knapsack problem, 
and is also solved by the heuristic developed by Kubo and Fujisawa mi. 

(2) If z is infeasible, we shall change some Zi from -|-1 to —1. This change is 
also made in a greedy manner as in (1). We also improve the obtained solution 
by applying the heuristic by EZ] after formulating the problem as a knapsack 
problem. 

In the implementation of Algorithm 2, we generate random unit vectors u on 
B\ as many times as in Algorithm 1. Among the solutions obtained above, we 
choose the best one. 



4 Computational Experiments 

In order to see the effectiveness and efficiency of the proposed two algorithms, 
we have performed computational experiments. Problem instances are generated 
from sales data related to Lavenus sales promotion mentioned in Section 1. 
We have chosen in our experiments six stores different from those for which 
Pharma performed experiments, and used sales data for three months starting 
from March of 1997. We have focused on customers who belongs to Pharma’s 
member club and visited those stores at least three times during those three 
months. We concentrate on the same 660 brands correlated to Lavenus products 
that Pharma identified. Those brands are classified into seven classes as follows: 



Table 1. Seven categories of commodities 



class # 


description 


of brands 


1 


medical and pharmaceutical prodncts 


187 


2 


medical appliances and equipments 


56 


3 


health foods 


38 


4 


miscellaneous daily goods 


252 


5 


cosmetics 


76 


6 


baby articles 


43 


7 


others 


8 



Approximation of Optimal Two-Dimensional Association Rules 



155 



We consider that these seven classes are different attributes. The domain of 
an attribute is a set of brands that fall into the corresponding class. Let Sij stand 
for the number of customers who bought brands i and j. Among such customers 
let hij denote the number of customers who also bought Lavenus products. We 
generate problem instances by choosing two distinct classes. We also generate 
problem instances of larger size by grouping seven classes into two disjoint sets. 
In this case, each set is regarded as a single attribute. The problem instances we 
generated for computational experiments are listed in Table 2. For each problem 
instance we have tested two or three different values of M. 

Table 2. Problem instances generated for numerical experiments 



problem # 


(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


(8) 


(9) 


classes of A 


2 


3 


5 


2 


5 


5 


1 


1,2 


1,5,6 


classes of B 


1 


4 


1 


4 


4 


1,2,3 


4 


3, 4,5, 6,7 


2,3,4, 7 



In order to compare the performance of two-dimensional rules generated by 
our algorithms with one-dimensional ones, we have considered the following prob- 
lems Qa and Qb- Let 



riB 

Si. = s 


riA 

ip ^ 0 ~ ^ij- 




f=i 


i=l 




are similarly defined. 

Qa ■ maximize hi. 


U C dom(A), Si. < M}. 


(7) 


ieu 


i&U 




Qb ■ maximize h.j 


V C dom(B), ^ s.j < M}. 


(8) 


j&v 


iev 





Both problems are knapsack problems and are solved by applying the algorithm 
of ^3- We then choose the better one between the obtained two solutions. 

Computational results for Problem P are shown in Table 3. Experiments 
have been carried out on DEC Alpha 21164 (600MHz). 

The first column indicates the problem No. and the problem size (ua x 
ub)- The second column indicates the overall average of the confidence, i.e., 
"Y^hij / '^Sij where the sum is taken over all pairs of brands (i,j) that belong 
to dom(A) X dom(i?). The third column shows the ratio of M to where 

the sum is taken over all pairs of brands that belong to dom( A) x dom(i?) . The 
fourth column represents the ratio of the objective value for an approximate 
solution to the upper bound of the optimal objective value obtained by SDP. 
Here the approximate solution indicates the better one between those obtained 
by Algorithms 1 and 2. The fifth column indicates CPU time spent by SDPA 
software. The sixth, the seventh and the eighth columns indicate the confidence 
of two-dimensional rules derived by Algorithms 1 and 2, and of one-dimensional 
rule, respectively. The asterisk * indicates that it exhibits the better performance 
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Table 3. Computational results for problem instances tested 



problem 7 ^ 
& 

size ua X ub 


ave. confidence 

(y~) ^ 0 / X/ ) 


M 


approx. 

ratio 


time 

(sec.) 


confidence 
(2-D rule) 

Algo. 1 


confidence 
(2-D rule) 

Algo. 2 


confidence 
(1-D rule) 


( 1 ) 

56 X 187 


6.3% 

(132/2085) 


1/8 


88% 


27.5 


*25.3% 


23.0% 


19.9% 


1/4 


89% 


25.7 


15.5% 


*16.5% 


15.5% 


(2) 

38 X 252 


7.4% 

(93/1256) 


1/8 


82% 


47.7 


36.9% 


*38.2% 


30.0% 


1/4 


91% 


45.0 


23.9% 


*25.5% 


21.3% 


(3) 

76 X 187 


7.8% 

(154/1976) 


1/8 


83% 


39.7 


26.3% 


*30.0% 


24.7% 


1/4 


91% 


36.2 


21.7% 


*21.9% 


19.4% 


(4) 

56 X 252 


9.3% 

(477/5149) 


1/8 


89% 


63.0 


*28.7% 


*28.7% 


23.0% 


1/4 


95% 


54.0 


20.7% 


*20.9% 


17.6% 


(5) 

76 X 252 


7.9% 

(491/6194) 


1/8 


85% 


76.9 


22.5% 


*22.9% 


19.4% 


1/4 


90% 


72.7 


16.0% 


*17.0% 


15.1% 


(6) 

76 X 281 


7.3% 

(187/2578) 


1/8 


84% 


101.1 


27.3% 


*29.5% 


23.9% 


1/4 


91% 


87.1 


18.8% 


*20.8% 


18.8% 


(7) 

187 X 252 


6.9% 

(1476/21513) 


1/8 


90% 


211.9 


22.3% 


*22.5% 


20.4% 


1/4 


91% 


184.2 


14.4% 


*20.8% 


14.4% 


(8) 

243 X 417 


6.9% 

(2409/35133) 


1/16 


77% 


739.7 


*29.3% 


25.6% 


24.6% 


1/8 


86% 


848.3 


21.0% 


*21.2% 


19.8% 


1/4 


94% 


847.3 


15.1% 


*16.2% 


15.1% 


(9) 

306 X 354 


6.4% 

(2545/39470) 


1/16 


76% 


814.8 


*27.0% 


25.1% 


24.2% 


1/8 


87% 


851.7 


*20.0% 


*20.0% 


18.7% 


1/4 


95% 


757.8 


14.4% 


*15.2% 


14.4% 



than the one without *. We see from the table that the proposed algorithms 
produce good approximate solutions in a reasonable amount of time. The time 
spent for obtaining one rounded solution from SDP solution by both Algorithms 
1 and 2 ranges from 1 to 13 seconds depending the problem size (most of the time 
is spent for solving knapsack problems). Since we obtain n different candidate 
solutions for both Algorithms 1 and 2, the time required for obtaining the best 
approximate solution from an SDP solution is, on the average, three times larger 
than that for obtaining an SDP solution. Notice that for the cases of M/ ^ Sy = 
1/8, we see a significant difference between the confidence of two-dimensional 
association rules and that of one-dimensional rules. 

As for comparison of Algorithms 1 and 2, it is observed that Algorithm 2 
produces better solutions for many cases while the difference of performances 
between Algorithms 1 and 2 gets closer as becomes smaller. In 

particular, for problems 8 and 9 with = 1/16, Algorithm 1 exhibits 

better performance. 

Notice that for two attributes A and B in our problem instance, s{R) for 
R= UxV with U C dom(A) and V C dom(B) does not represent, in general, the 
number of customers who bought at least one brand in U and at least one brand 
in V because a single customer may have bought many pairs of brands (i, j) G R 
and such a customer is counted the same many times in s{R). Therefore, in 
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order to evaluate the practical effectiveness of the rules generated, we need to 
calculate the number of customers that satisfy the condition of the association 
rule obtained for each problem instance, and the number of customers that 
satisfy the target condition (i.e., bought Lavenus products). We call the former 
number c-support, and the latter c-hit (“c-” is attached in order to distinguish 
the support and the hit defined in Section 2). We call the ratio of c-hit to c- 
support a hit ratio of the rule. 

We have computed the c-supports, c-hits and hit ratios of the four two- 
dimensional rules obtained by our algorithm for the last two problem instances, 
i.e., the rules found in the second last problem instance for = 1/8 

and 1/4 (denoted by Rules 1 and 2, respectively), and the rules found in the 
last problem instance for Af/^Sy = 1/8 and 1/4 (denoted by Rules 3 and 4, 
respectively). Results are summarized in Table 4. In the second last and last 
problem instances, numbers of customers that bought at least one brand from A 
and at least one brand from B are 4822 and 5190, respectively, and the numbers 
of Lavenus users among them are 184 (3.8%) and 198 (3.8%). Therefore, as seen 
from Table 3, our algorithm found a good customer segmentation. 



Table 4. Comparison of two-dimensional rules and conventional rules in terms 
of c-support, c-hit and hit ratio 



rules 


Rule 1 


Rule 2 


Rule 3 


Rule 4 


1 at-least-fe rules | 




k = 4 










c-support 


916 


1602 


1126 


1921 


5806 


4009 


2775 


1961 


1375 


959 


c-hit 


83 


112 


95 


114 


253 


214 


169 


137 


103 


75 


hit ratio 


9.1 % 


7.0% 


8.4% 


5.9% 


4.4% 


5.3% 


6.1% 


7.0% 


7.5% 


7.8% 



As mentioned in Section 1, Kao and Pharma, in their experiments, adopted 
the rule to select the customers for direct mail distribution such that they bought 
at least three distinct brands for a certain time period from among 660 ones 
that Pharma identified. Let us call the rule at-least-3 rule. We can generalize 
this rule to at-least-k rule. For comparison, we consider k with 3 < fc < 8. We 
have computed the c-supports, c-hits and hit ratios for such rules. The results 
are summarized in Table 3. We see from the table that Rules 2 and 4 are a bit 
worse than at-least-6 or 7 rule, while Rules 1 and 3 are better than at-least-8 
rule. 

5 Conclusion 

We have proposed an approximation algorithm based on SDP relaxation for find- 
ing optimal two-dimensional association rules for categorical attributes. From 
computational experiments, it was observed that the proposed algorithm finds 
a good segmentation in a reasonable amount of time. We finally observed how 
effective the rules obtained are in terms of customer segmentation by applying 
the algorithm to real sales data. 
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There are still several tasks remaining for future research. 

(1) First, we want to improve an approximation ratio of problem P, i.e., the 
densest subgraph problem for bipartite graphs. The current best ratio is the 
same as the one obtained for general graphs. 

(2) The proposed algorithm can be extended to the problems with entropy gain 
or interclass maximization in a straightforward manner. However, it requires 
more computation time. So, we need to further improve the practical effi- 
ciency of the algorithm. 

(3) We would like to conduct an experiment to see the robustness of the obtained 
two-dimensional association rules in terms of the ability of future forecasting. 

(4) From the viewpoint of promotion sales through direct mail distribution, we 
would like to carry out the experiments to see the difference between response 
rates of customers selected by the rules obtained in our algorithm and of 
those by other rules. 
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Abstract. This paper proposes an efficient method for data mining 
of generalized association rules on the basis of partial-match retrieval. 
A generalized association rule is derived from regularities of data pat- 
terns, which are found in the database under a given data hierarchy with 
enough frequencies. The pattern search is a central part of data mining 
of this type and occupies most of the running time. In this paper, we 
regard a data pattern as a partial-match query in partial-match retrieval 
then the pattern search becomes a problem to find partial-match queries 
of which answers include sufficient number of database records. The pro- 
posed method consists of a selective enumeration of candidate queries and 
an efficient partial-match retrieval using signatures. A signature, which 
is a bit sequence of hxed length, is associated with data, a record and a 
query. The answer for a query is fast computed by bit operations among 
the signatures. The proposed data mining method is realized based on 
an extended signature method that can deal with a data hierarchy. We 
also discuss design issues and mathematical properties of the method. 



1 Introduction 



Data mining of association rules fAlS9,3L I AS94| l MTV94| is a promising technol- 
ogy to extract useful knowledge from a database, however, its direct application 
to a practical database often ends with an undesirable result. In this type of 
data mining, it is a central part to find regularities of data patterns that appear 
in a database with enough frequencies. For a database having too detailed ex- 
pressions, essential regularities of data patterns may hide among diverse kinds 
of patterns with small frequencies. We in this case need to ignore minor differ- 
ences among data patterns and to focus only on the essential information. Data 
mining of generalized association rules is developed jS A 04j lHFh.5] based on this 
observations. 

In data mining of generalized association rules, information on a data hi- 
erarchy is given in addition to the database. The data hierarchy is intended 
to specify generalized or abstract data for each raw data in the database. The 
database can be converted into a generalized database by identifying a set of 
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data with a higher level data in the hierarchy. Once a generalized database is 
constructed, a generalized association rule can be extracted by a similar method 
to the case of an usual association rule. However, the possibilities of general- 
ized databases cause an explosion of the search space, increasing the efficiency 
becomes a crucial problem. Thus, developing an efficient data pattern search 
method in a generalized database is a main theme of this paper. 

In this paper, we investigate data mining of generalized association rules from 
a view point of partial-match retrieval, which is an important branch of database 
studies. Although both data mining and partial-match retrieval purpose to se- 
lectively extract information from a database, the similarity is unsatisfactorily 
applied in the previous studies on data mining. In IIAb94l blA98l I11JA98I . they 
implement data mining procedures using the standard database query language 
SQL. They give considerations on architectures which connect a database sys- 
tem and a data mining system, however, the efficiency issues are not sufficiently 
examined from the side of database technologies. 

We in this paper regard a data pattern as a partial-match query in partial- 
match retrieval. The search for a frequently appeared data pattern becomes a 
problem to find a partial-match query of which answer includes sufficient number 
of records. In this setting, we develop a data pattern search method that consists 
of a selective enumeration of candidate queries and an efficient partial-match 
retrieval method using signatures as the secondary indices for the database. The 
signature encodes information in the database and data hierarchy in compact 
bit sequences, the main part of the method is fast carried out by simple bit 
operations. Moreover, the space overhead caused by the signatures is relatively 
small in most of practical cases. 

In Chapter 0 we formalize partial-match retrieval to deal with a data hierar- 
chy and consider an efficient retrieval method using signatures. In Chapter 0 we 
consider data mining of generalized association rules in terms of partial-match 
retrieval and propose a new data mining method. We show some properties of 
the method and discuss performance issues with mathematical considerations. 



2 Generalized Partial-Match Retrieval 



For simplicity, we merely identify a record in a database with a set of keywords 
and suppose the set of all keywords is prespecified by K. Thus, data dealt with 
in this paper are restricted to keywords. We further suppose another finite set 
G, where iL C G = (/>, is also prespecified. G is intended to define generalized 
keywords for elements in K . We hereafter call an element of AT U G a keyword. 
If it is necessary, we distinguish a raw keyword in K from a generalized keyword 
in G. 

A keyword hierarchy is a partial-order A on ATUG, where no elements ki, kj € 
K exist such that ki A kj or kj A ki holds. Keywords x and y are comparable 
if a; A 2 / or y A x holds, otherwise they are incomparable. A keyword hierarchy 
naturally requires that any pair of raw keywords should be incomparable. This 
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is necessary to avoid a mixing of different generalization levels in an original 
record. 

For a keyword x € K U G, anc{x) = {a £ K LI G \ x ^ a} is called the set of 
ancestors for x. 

Example 1. Figure 0 shows a keyword hierarchy as a directed acyclic graph. 
In this graph, x < y holds if there is an arrow from y to x. K = { Jacket, 
Ski Pants, Shirt, Shoes, Hiking Boots } is the set of raw keywords and G = 
{ Outerwear, Clothes, Footwear } is the set of generalized keywords. There is no 
arrow connecting elements in K, no pair in K is comparable. 



Clothes 




Outerwear Shirts 




Footwear 




Shoes Hiking Boots 



Fig. 1. Example of Keyword Hierarchy 



Record 


Keyword 


Ri 


Shirt 


R2 


Jacket, Hiking Boots 


Rz 


Ski Pants, Hiking Boots 


R 4 


Shoes 


Rz 


Shoes 


Re 


Jacket 



Fig. 2. Example of Database 



For a keyword hierarchy H = {K U G, and a record R = {ri, . . . ,rm} 
having m keywords, the generalized record for R with respect to H, denoted by 
ext^ (R), is obtained replacing each by the set of ancestors anc{ri), 

m 

ext^ {R) = anc{ri) 

i=l 

For a keyword hierarchy H = {KUG,^) and a database D = {i?i, . . . , R^} 
having N records, the generalized database with respect to H, denoted by 
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ext^{D), is obtained replacing each record Ri by the generalized record 
ext^{Ri), 

ext^{D) = {ext^{Ri), . . . , ext^ (R n)} 



Example 2. The database shown in Figure is the generalized database for the 
database shown in Figure 0with respect to the keyword hierarchy in Figure Q 



Record 


Keyword 


Ri 


Shirt, Clothes 


R2 


Jacket, Hiking Boots, Outerwear, Clothes, Footwear 


Rs 


Ski Pants, Hiking Boots, Outerwear, Clothes, Footwear 


Ra 


Shoes, Footwear 


Rs 


Shoes, Footwear 


Re 


Jacket, Outerwear, Clothes 



Fig. 3. Example of Generalized Database 



Similar to the record case, a partial-match query is merely identified with a 
set of keywords that possibly includes generalized keywords. We do not consider 
logical connectives of keywords in the following discussions. 

For a keyword hierarchy H, a partial-match query Q and a database D, gen- 
eralized partial-match retrieval finds the generalized answer set AR{Q, ext^ (D)) 
with respect to H, which is defined by 

AR{Q , ext^ (D)) = {Ri \ RiG D, Q C ext^ (Ri)} 



2.1 Partial- Match Retrieval Using Signatures 

We consider an efficient method to find the generalized answer set using signa- 
tures |h'al8bllFBT^ . 

A keyword-to-signature function sgu^ey maps a keyword of K U G into a bit 
sequence of length b, where exactly w bits of the sequence are set to 1. The bit 
sequence is called the signature for the keyword and w is called the weight of 
the signature. 

In the following, 0 and 0 are used for taking logical bitwise OR and AND 
with bit sequences, respectively. 

For a record R and a keyword hierarchy H = {K U G, A), the signature 
for R with respect to H is defined by using the generalized record ext^{R) = 
{ri,...,rm}, 

m 

sgn{ext^{R)) = ^sgnkey{rj) 
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The signature for a partial-match query Q = {qi, 
puted by 

i 

sgn{Q) = 0 sgukeyiqj) 



is similarly com- 



For a database D, a partial-match query Q and a keyword hierarchy H = 
{K U G, ^), partial-match retrieval using signatures computes the drops with 
respect to H, denoted by D R{Q , ext^ (D)) , 



DR{Q, ext^{D)) = {R ^ D \ sgn{Q) = {sgn{Q) sgn{ext^ {R)))] 

The following properties hold between the drops and the generalized answer 
set IFal85l IFljY92l . 

Proposition 1. For a database D, a keyword hierarehy H = (K U G, and a 
partial-match query Q, it holds 

AR{Q , ext^ (D)) C D R{Q , ext^ (D)) 



Corollary 1. For a record R, a keyword hierarchy F[ = (FfUG, and a partial- 
match query Q, if a bit in the signature sgn{ext^ (R)) becomes 0 where the same 
bit in the query signature is set to 1, then the record R cannot satisfy the query 
Q with respect to FI . 

From Proposition 0 we can assure that all records in the generalized answer 
set are always found using the signatures without a leak. From Corollary 0 a 
record that is not satisfied with the query can be removed by a restrictive search 
the bit positions where the query signature’s bits are set to 1. 

Using Proposition 0, the drops can be decomposed into two disjoint subsets 
Dq and Dq, where DR{Q, ext^ (D)) = Dq U Dq. We call Dq and Dq the good 
drops and the false drops, respectively. 

The false drop is an inevitable noise caused from a information loss in mak- 
ing the signature. In spite of the mixing of the false drops, the drops has an 
important meaning to perform efficient generalized partial-match retrieval. The 
answer set for a query can be obtained by restrictively examining the drops, the 
entire generalized database is not necessary compared with the query in a direct 
manner. The point in this case is to design the signature to minimize the false 
drops. We consider an optimal signature design problem in Section El 

3 Data Mining Based on Partial-Match Retrieval 

3.1 Definitions 

We in the previous part informally describe that a generalized association rule is 
extracted from a regularity of data patterns. Since this paper regards a record as 
a simple set of keywords, a data pattern is identified with a mere set of keywords. 
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For any set of keywords, we can conversely consider the corresponding partial- 
match query. Thus, a generalized association rule is formally defined in terms of 
partial-match retrieval as follows. 

Definition 1. For a database D of size N, a keyword hierarchy H = (KUG, A) 
and a real number 0 < a < 1, a generalized a-pattern of size k with respect to H 
is a partial-match query Q f- K U G that has k incomparable different keywords 
and satisfies 

ffAR{A, ext^{D)) 



N 



> a 



Remark that we exclude comparable keywords from a cr-pattern to avoid 
meaningless patterns. 

Definition 2. For a database D of size N, a keyword hierarchy iJ = (iFUG, A) 
and real numbers 0 < cr, 7 < 1, an expression A ^ B, where A, B f- K \J G and 
A r\ B = (j), is called a generalized association rule with respect to FI in D of 
support cr and confidence 7, if it satisfies 



#AR{A\JB,ext^{D)) 

N 

#AR{A\JB,ext^{D)) 
ffAR{A, ext^{D)) 



> (T, and 



> 7 



Data mining of generalized association rules is now defined as a problem to 
finds the set of all generalized association rules with respect to FI in D of support 
a and confidence 7 for a given database D, a set of keywords FI and real numbers 
0 < cr, 7 < 1 . 



3.2 Outline of System 

Similar to the previous works lAlSDdI |Abi94| IIVI 1 V 94t IbADdl , the developed sys- 
tem extracts generalized association rules according to the two main stages, 
which are searching for cr-patterns and constructing rules from the patterns. 
This paper focuses only on the first stage, of which outline data flows are shown 
in Figure 0 Before running a procedure of the first stage, the necessary infor- 
mation must be stored in the system along the solid lines. 

A given keyword hierarchy F[ = (K U G, A) is firstly stored in the keyword 
hierarchy base KFl. When a record R is arrived, the keyword extractor KE 
associates the record with a set of keywords Ki. Using the keyword hierarchy 
base KF[, the keyword generalizer KG computes the generalized keyword set 
for Ki and makes the extended record ext^{R). The signature composer SG 
computes the signature sgn{ext^ (R)) for the extended record. The computed 
signature is stored in the signature file S. 

Once all records are registered, the system can start the first stage. We search 
for the generalized partial-match queries whose answer set includes more than 
aN records for a given real number 0 < cr < 1- In this case, a brute force search 
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that enumerates all possible queries is impossible for a practical database. An 
efficient candidate query generation method is discussed in the next subsection, 
we here simply assume a candidate query Q is one by one generated. In this 
system, we examine a generated query Q using the drops instead of the answer 
set. That is, the query is qualified if it satisfies D R{Q , ext^ (D)) > uN. 

The dotted lines in Figure 0]show the data flows for finding the drops. For a 
candidate query Q, the keyword extractor KE identifies Q with a set of keywords 
Kq. Using the keywords set Kq, the signature composer computes the signature 
sgn(Q) using exactly the same keyword-to-signature function used for the record 
signatures. The signature searcher SC scan the signature file S to find the drops 
DR{Q,ext^{D)). 



KE 

Keywords Extractor 



ext (Rj^) 



Kq 




KG 

Keywords Generalizer 



SC 

Signature Composer 



sgn(ext(Rj^)) 



sgn(KQ) 



SS 

signature Searcher j 





DR(Q,ext(D)] 



Fig. 4. Outline of System 
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3.3 Efficient Candidate Generation 

The time complexity of cr-pattern search becomes 0(|C| ■ Tq), where |C| is the 
total number of generated candidates and Tg is the time to carry out generalized 
partial-match retrieval for a query. Thus, the size of the generated candidate sets 
directly influences the performance of the system. 

The observations given in pASD4l IM'I’VDdj are still useful for this case. The 
candidate Ck+i is made from Fk basically based on the next condition, 

Cfc+i = {X I = k + 1, X is composed of k + I different members in Fk} 

where Fk is the set of all cr-patterns of size k. 

In the generalized case, more pruning of redundant elements is possible using 
a given keyword hierarchy. The next definition is necessary for this purpose. 

Definition 3. For a keyword hierarchy F[ = {K U G, A), we extend the partial- 
order A to the set of partial-match queries. Partial-match queries Q\ and Q 2 
including no comparable keyword satisfies a relation Q 2 ^ Qi, if the following 
conditions are satisfied, 

1. for any x & Q\, there exists y G Q 2 such that y F x, and 

2. for any y G Q2, there exists x G Qi such that y F x 

According to this order on queries, if a partial-match query fails to have 
enough number of drops in a generalized level, we can terminate further search 
in a lower level of queries. The next proposition justifies this observation. 

Proposition 2. For a database D, a keyword hierarchy iJ = {K U G, A) and 
partial-match queries Q2 Gi Qi, it holds 

AR{Q 2 , ext^ (D)) C AR{Qi,ext^ (D)) 

Corollary 2. For a database D, a keyword hierarchy F[ = (K U G, A) and a 
real number 0 < tx < 1, a partial-match query Q 2 cannot become a generalized a- 
pattern if there exists a partial-match query Qi such that Q\ is not a generalized 
a -pattern and satisfies Q\ FQ 2 . 

We once make a candidate set Ck in the usual manner and remove redun- 
dant element from Ck using Corollary 0 every time we find an unqualified 
candidate. If x G Ck is verified to have less than aN records in the drops, 
D R{Q , ext^ (x)) < crN, all candidate y G Ck such that y <x are removed from 
Ck without performing partial-match retrieval. In Figure the inner most if 
block realize this removal. 

In Figure we summarize the central part of the cr-pattern search proce- 
dures. The drops inevitably include the false drops, cr-patterns found by this 
algorithm are affected by the false drops. We use overlined symbols, Fk and Gfc, 
in the algorithm to emphasize the mixing of the false drops. 

In spite of the false drops, we can assure the correctness of the algorithm in 
the next proposition. 
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Proposition 3. For a given database D, a set of keywords H = {K U G, and 
a real number 0 < cr < 1, let Fk be the set of all generalized a-patterns of size k 
and F k be the set found in the k-th iteration of the repeat loop in the algorithm 
Then Fj~ Q Fk holds for any k. 



Input: a database D of size N, a real number 0 < cr < 1 and 
a keyword hierarchy H = {K U H,^) 

Output: set of all generalized cr-patterns of all sizes 

begin 

let Pi be the set of all generalized cr-patterns of size 1; 

fc = 1; 

repeat 

begin 

Ck+i = generate(Pfc); _ 

foreach partial-match query Q £ Ck+i 

begin 

find DR{Q,ext^ {D))\ 

it DR{Q,ext^{D)) < aN do 

delete each x £ Ck+i such that x < Q 

end 

Pfc+i = Gfe+i; 

k=k-|-l; 

end 

until Fk = f 
end 



Fig. 5. Finding Generalized cr-patterns Using Drops 



3.4 Optimal Signature Design 

As we see in the previous subsection, the developed algorithm always finds the 
set of all cr-patterns. To increase the accuracy of the algorithm, we consider 
according to Ih'al85l an optimal signature design method that minimize 

the false drops. The design parameters are briefly summarized in Table 0 

Let be a given expected number of false drops per query. To optimaly 
achieve Df, the signature bit length b and the weight w are determined by 









w 



b 



— exp 




where N is the size of the database. 
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b 


bit length of signature 


w 


weight of signature 


n 


number of keywords per record 


rQ 


number of keywords per query 



To apply the above optimal design formulae, we need to estimate the values of 
Vi, tq and Df. Since the size of false drops decreases as the number of keywords 
rq increases, adopting rg = 2 brings the best results for rq > 2. Remark here 
we find the cr-patterns of size 1 without using the signatures. 

The expected number of false drops Df depends on a situation. When we 
search for cr-patterns with small tr, partial-match retrieval should be carried 
out with severe accuracy. Thus, the expected false drops need to be determined 
sufficiently smaller than aN, where N is the database size. As a result, we 
determine the value of Df by aaN for a prespecified 0 < a < 1 . 

For a keyword hierarchy H = (K U G, A), we assume each raw keyword in 
K has at most m ancestors. Then, the generalized record ext^ {R) has at most 
m • Xi keywords for a record R having keywords. 

As a result, the bit length and the weight are optimally designed by 



6 ft! 





aaN 



w 



b 



— exp 




In Table 0 we summarize the optimal parameter values for some cases. The 
last column of this table shows the size of space overheads in bytes to store the 
signatures. 



4 Conclusions 



We conclude this paper with some considerations. We explain an outline of a 
new data mining system for generalized association rules based on a method of 
partial-match retrieval using signatures. We selectively access to the signature 
file via efficient bit operations, the performance of the system can be maintained 
even for a large database. The important advantage of the signature method 
is a balance of the retrieval efficiency and storage overhead. The storage space 
necessary for our system becomes at most 





• N 



bits, where the parameters are the same in the previous section. 
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Table 2. Optimal Parameter Values in gen-MARS 
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a 


a 


aaN 


ri 


m 


w 


b 


bN (bytes) 


10'^ 


10”^ 


10"® 


1 


5 


3 


5 


108 


13.5 X 10'® 


10® 


10"^ 


10"® 


1 


5 


5 


5 


180 


22.5 X 10® 


10® 


10“® 


10"® 


10"® 


5 


3 


9 


180 


22.5 X 10® 


10® 


10“® 


10"® 


10"® 


5 


5 


9 


300 


37.5 X 10® 


IF 


icr^ 


10"® 


10 


10 


3 


5 


216 


270.0 X 10® 


10'‘ 


10"^ 


10"® 


10 


10 
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5 
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10"® 


10 
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10 
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10"® 
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10 
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20 
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9 
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10"® 


1 


20 


5 


9 
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10"® 


20 


3 


12 


1007 


125.9 X 10® 


10® 


10“^ 


10"® 


10"® 


20 


5 


12 


1678 


209.8 X 10® 
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1000 


30 


3 


5 
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10"® 
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30 


5 


5 


1079 


134.9 X 10® 
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10“® 


10"® 


10 


30 
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9 


1079 
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10“® 
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10 


30 
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10"® 


1 


30 


3 


10 


1294 


161.8 X 10® 


10® 


10“^ 


10"® 


1 


30 


5 


10 


2157 


289.6 X 10® 



On the other hand, the simple bit expression that is commonly used in 
lAlSDSL [ASMl LVri' V(j4l KA04I requires at most m • ■ N bits. In most practi- 

cal databases that includes large number of different keywords, our method has 
superiority as to the space overheads. 

Moreover, the proposed system can easily introduce further technologies in 
partial-match retrieval. A prefix pattern search IKRV!I‘/>I is such an example. For 
textual data mining fFHytil IFTTAITT) that attracts increasing recent interests, we 
often need to identify keywords agree with the prefixes of specified length. We 
can achieve this requirement by improving the keyword-to-signature function. 

The idea shown in this paper is realized in an industrial data mining system 
that is under implementation at present. Experimental results will be appeared 
in a future paper. 
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Abstract. Scalability is a key requirement for any KDD and data min- 
ing algorithm, and one of the biggest research challenges is to develop 
methods that allow to use large amounts of data. One possible approach 
for dealing with huge amounts of data is to take a random sample and 
do data mining on it, since for many data mining applications approxi- 
mate answers are acceptable. However, as argued by several researchers, 
random sampling is difficult to use due to the difficulty of determining 
an appropriate sample size. In this paper, we take a sequential sampling 
approach for solving this difficulty, and propose an adaptive sampling 
algorithm that solves a general problem covering many problems aris- 
ing in applications of discovery science. The algorithm obtains examples 
sequentially in an on-line fashion, and it determines from the obtained 
examples whether it has already seen a large enough number of exam- 
ples. Thus, sample size is not fixed a priori; instead, it adaptively depends 
on the situation. Due to this adaptiveness, if we are not in a worst case 
situation as fortunately happens in many practical applications, then we 
can solve the problem with a number of examples much smaller than the 
required in the worst case. For illustrating the generality of our approach, 
we also describe how different instantiations of it can be applied to scale 
up knowledge discovery problems that appear in several areas. 



1 Introduction 

Scalability is a key requirement for any knowledge discovery and data mining 
algorithm. It has been previously observed that many well known machine learn- 
ing algorithms do not scale well. Therefore, one of the biggest research challenges 
is to develop new methods that allow to use machine learning techniques with 
large amount of data. 

Once we are facing with the problem of having a huge input data set, there 
are typically two possible ways to address it. One way could be to redesign 
known algorithms so that, while almost maintaining its performance, can be 
run efficiently with much larger input data sets. The second possible approach 
is random sampling. For most of the data mining applications, approximate 
answers are acceptable. Thus, we could take a random sample of the instance 
space and do data mining on it. However, as argued by several researchers (see, 
for instance, m) this approach is less recommendable due to the difficulty of 
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determining appropriate sample size needed. In this paper, we advocate for this 
second approach of reducing the dimensionality of the data through random 
sampling. For this, we propose a general problem that covers many data mining 
problems and a general sampling algorithm for solving it. 

A typical task of knowledge discovery and data mining is to find out some 
“rule” or “law” explaining a huge set of examples well. It is often the case that 
the size of possible candidates for such rules is still manageable. Then the task 
is simply to select a rule among all candidates that has certain “utility” on the 
dataset. This is the problem we discuss in this paper, and we call it General 
Rule Selection. More specifically, we are given an input data set X of examples, 
a set TL of rules, and a utility function U that measures “usefulness” of each 
rule on X. The problem is to find a nearly best rule h, more precisely, a rule 
h satisfying U{h) > (1 — e)C/(/i*), where /i* is the best rule and e is a given 
accuracy parameter. Though simple, this problem covers several crucial topics 
in knowledge discovery and data mining as shown in Section ^ 

We would like to solve the General Rule Selection problem by random sam- 
pling. From a statistical point of view, this problem can be solved by taking first 
a random sample S from the domain X and then selecting h G with the largest 
U{h) on S. If we choose enough number of examples from X randomly, then we 
can guarantee that the selected h is nearly best within a certain confidence level. 
We will refer this simple method as a batch sampling approach. 

One of the most important issues when doing random sampling is choosing 
proper sample size, i.e., the number of examples. Any sampling method must 
take into account problem parameters, an accuracy parameter, and a confidence 
parameter to determine appropriate sample size needed to solve the desired 
problem. 

A widely used method is simply taking a fixed fraction of the data set (say, 
70%) for extracting knowledge, leaving the rest, e.g., for validation. A minor ob- 
jection to this method is that there is no theoretical justification for the fraction 
chosen. But, more importantly, this is not very appropriate for large amounts of 
data: most likely, a small portion of the data base will suffice to extract almost 
the same statistical conclusions. In fact, we will be looking for sampling methods 
that examine a number of examples independent of the size of the database. 

Widely used and theoretically sound tools to determine appropriate sample 
size for given accuracy and confidence parameters are the so called concentration 
bounds or large deviation bounds like the Chernoff or the Hoeffding bounds. 
They are commonly used in most of the theoretical learning research (see jQ 
for some examples) as well as in many other branches of computer science. For 
some examples of sample size calculated with concentration bounds for data 
mining problems, see, e.g.,|^, and |^. While these bounds usually allow 
us to calculate sample size needed in many situations, it is usually the case 
that resulting sample size is immense to obtain a reasonable good accuracy and 
confidence. Moreover, in most of the situations, to apply these bounds, we need 
to assume the knowledge of certain problem parameters that are unknown in 
practical applications. 
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It is important to notice that, in the batch sampling approach, the sample 
size is calculated a priori and thus, it must be big enough so that it will work well 
in all the situations we might encounter. In other words, the sample size provided 
by the above theoretical bounds for the batch sampling approach should be the 
worst case sample size and thus, it is overestimated for most of the situations. 
This is one of the main reasons why researchers have found that, in practice, 
these bounds are overestimating necessary sample size for many non worst-case 
situations; see, e.g., the discussion of Toivonen for sampling for association rule 
discovery ITU . 

For overcoming this problem, we propose in this paper to do the sampling in 
an on-line sequential fashion instead of batch. That is, an algorithm obtains the 
examples sequentially one by one, and it determines from those obtained exam- 
ples whether it has already received enough examples for issuing the currently 
best rule as nearly the best with high confidence. Thus, we do not fix sample 
size a priori. Instead sample size will depend adaptively in the situation at hand. 
Due to this adaptiveness, if we are not in a worst case situation as fortunately 
happens in most of the practical cases, we may be able to use significantly fewer 
examples than in the worst case. Following this approach, we propose a general 
algorithm — AdaSelect — for solving the General Rule Selection problem, which 
provides us with efficient tools for many knowledge discovery applications. This 
general algorithm evolves from our preliminary works on on-line adaptive sam- 
pling for more specific problems related to model selection and association rules 
done in EE- 

The idea of adaptive sampling is quite natural, and various methods for 
implementing this idea have been proposed in the literature. In statistics, in 
particular, these methods have been studied in depth under the name “sequential 
test” or “sequential analysis” However, their main goal has been to test 
statistical hypotheses. Thus, even though some of their methods are applicable 
for some instances of the General Rule Selection problem, as far as the authors 
know, there has been no method that is as reliable and efficient as AdaSelect 
for the General Rule Selection problem. More recent work on adaptive sampling 
comes from the database community and it is due to Lipton et.al. igiTOl . The 
problem they address is that of estimating the size of a database query by using 
adaptive sampling. While their problem (and algorithms) is very similar in spirit 
to ours, they do not deal with selecting competing hypotheses that may be 
arbitrarily close in value, and this makes their algorithms simpler and not directly 
applicable to our problem. From the data mining community, the work of John 
and Langley is related to ours although their technique for stopping the 
sampling is based on fitting the learning curve while our stopping condition of 
the sample is based on statistical bounds on the accuracy of the estimations 
being performed. For a more detailed comparison of our work to other methods 
in adaptive sampling, see the full version of this paper 

The paper is organized as follows. In the next section we define formally 
the problem we want to solve, propose an algorithm and show two theorems 
concerning its reliability and complexity (due to the space restrictions, the proof 
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of the theorems are omitted from this extended abstract, we refer the interested 
reader to the full version of the paper j^). In Section |3 we describe several 
applications of our algorithm to particular data mining problems. We conclude 
in Section ^ highlighting future work and pointing out several enhancements 
that can be done to the general algorithm. 

2 The Adaptive Sampling Algorithm 

In this section we will formally describe the problem we would like to solve; then 
we present our algorithm and investigate its reliability and complexity. 

We begin introducing some notation. Let X = {x\,X 2 , ■ ■ ■ ,Xk} be a (large) 
set of examples and let 7Y = {hi, . . . , hn} be a (finite, not too large) set of n 
functions such that hi : X M.. That is, h G H can be thought as a function that 
can be evaluated on an example x producing a real value yh,x = h{x) as a result. 
Intuitively, each h G Ti. corresponds to a “rule” or “law” explaining examples, 
which we will call below a rule, and yh,x = h{x) measures the “goodness” of the 
rule on x. (In the following, we identify h with its corresponding rule, and we 
usually call h a rule.) For example, if the task is to predict a particular Boolean 
feature of example x in terms of its other features, then we could set yh,x = 1 if 
the feature is predicted correctly by h, and yh,x = 0 if it is predicted incorrectly. 
We also assume that there is some fixed real-valued and nonnegative utility 
function Uih), measuring some global “goodness” of the rule (corresponding to) 
h on the set X. More specifically, for any S G_ X, U{h, S) is defined as 

U{h,S) = F{a.vg{yh,x ■ X G S)), 

where F is some function IR i— > IR and avg(- • •) denotes taking the arithmetic 
average, i.e., avg(oi : i G I) = (X)ig/ Oi)/|/|. Then U{h) is simply defined as 
U{h,X). In Section|^we will describe several applications of our framework and 
how U is instantiated to specific functions; then its meaning will become more 
clear. Now we are ready to state our problem. 

General Rule Selection 
Given: X, hi, and e, 0 < e < 1. 

Goal: Find h Ghi such that U{h) > (1 — e) • U{hif), 

where hi, G 'H he the rule with maximum value of C/(/i*). 

Remark 1: (Accuracy Parameter e) 

Intuitively, our task is to find some h gTL whose utility is reasonably high com- 
pared with the maximum 17(/i*), where the accuracy oi Uih) to U{hif) is specified 
by the parameter e. Certainly, the closer U{h) is to f7(/i*) the better. However, 
in certain applications as discussed in Section accuracy is not essential, and 
we may be able to use a fixed e, such as 1/4. 

Remark 2: (Confidence Parameter 5) 

We want to achieve the goal above by “random sampling”, i.e., by using exam- 
ples randomly selected from X. Then there must be some chance of selecting 
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bad examples that make our algorithm yields unsatisfactory h G Ti.. Thus, we 
introduce one more parameter J > 0 for specifying confidence and require that 
the probability of such error is bounded by i5. 

Remark 3: (Condition on TL) 

We assume in the following that the value of yh,x (= h{x)) is in [0,d] for some 
constant d > 0. (From now on, d will denote this constant.) 

Remark J^: (Condition on [/) 

Our goal does not make sense if U{hif) is negative. Thus, we assume that U{hif) 
is positive. Also in order for (any sort of) random sampling to work, it cannot 
happen that a single example changes drastically the value of U ; otherwise, we 
would be forced to look at all examples of X to even approximate the value of 
U{h). Thus, we require that the function F that defines U is smooth. Formally, 
F needs to be c-Lipschitz for some constant c > 0, as defined below. (From now 
on, c will denote the Lipschitz constant of F.) 

Definition 1. Function F : IR i— > IR is c-Lipschitz if for all x, y it holds |F"(x) — 
F(y) \ < c • |x — y|. The Lipschitz constant of F is the minimum c > 0 such that 
F is c-Lipschitz (if there is any). 

Observe that all Lipschitz functions are continuous, and that all differentiable 
functions with a bounded derivative are Lipschitz. In fact, if F is differentiable, 
then by the Mean Value Theorem, the Lipschitz constant of F is max^, |F'(a:)|. 
As we will see in Section 0 all natural functions used in the applications we 
describe satisfy this condition for some c. Also note that from the conditions 
above, we have —cd < U{h) < cd for any h G Ti.. 

Remark 5: (Minimization Problem) 

In some situations, the primary goal might not be to maximize some utility 
function over the data but to minimize some penalty function P. That is, we 
want to find some h such that P{h) < (1 + e)P(h*). We can solve the General 
Rule Selection Problem by an algorithm and analysis very similar to the one we 
present here. (End Remarks.) 

One trivial way to solve our problem is evaluating all functions h in Tl over 
all examples x in A, hence computing U{h) for all h, and then finding the h that 
maximizes this value. Obviously, if X is large, this method might be extremely 
inefficient. We want to solve this task much more efficiently by random sampling. 
That is, we want to look only at a fairly small, randomly drawn subset S F X, 
find the h that maximizes U{h, S), and still be sure with probability 1 — h that 
this h we output satisfies U{h) > (1 — e) • U{hi,). 

One can easily think of the following simple batch sampling approach. Obtain 
a random sample S from A of a priori fixed size m and output the function 
from that has the highest utility in S. There are several statistical bounds 
to calculate an appropriate number m of examples. In this paper, we choose 
the Hoeffding bound, which has been widely used in computer science (see, e.g., 
)• One can use any reasonable bound here, and choosing which one to use 



Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms 177 



should be determined by considering reliability and efficiency. The reason that 
we choose the Hoeffding bound is that basically no assumption is necessary 0 for 
using this bound to estimate the error probability and calculate the sample size. 
Bounds such as the central limit theorem may be used safely in most situations, 
although formally they can only be applied as a limit, not for a finite number of 
steps. It is easy to modify our algorithm and analyze it with these alternative 
bounds. 

Roughly, for any sample size and error value a, the Hoeffding bound provides 
us with an upper bound on the probability that an estimate calculated from a 
randomly drawn sample is apart from its real value by a. Thus, by using this 
bound, we can determine sample size m that guarantees that the batch sampling 
yields a rule satisfying the requirement of the problem with probability at least 
1 -(5. 

While the batch sampling solves the problem, its efficiency is not satisfactory 
because it has to choose sample size for the worst case. For overcoming this 
inefficiency, we take a sequential sampling approach. Instead of statically decide 
the sample size, our new algorithm obtains examples sequentially one by one 
and it stops according to some condition based on the number of examples 
seen and the values of the functions on the examples seen so far. That is, the 
algorithm adapts to the situation at hand, and thus if we are not in the worst 
case, the algorithm would be able to realize of that and stop earlier. Figure Q 
shows a pseudo-code of the algorithm we propose, called AdaSelect, for solving 
the General Rule Selection problem. 



Algorithm AdaSelect(A, Tl, e, 5) 

repeat 

t^t+1- 

X <— randomly drawn example from X; 

St ■■= St-i U {*}; 

at <— {cd/ P) -\- l)/5)/{2t)\ 

% Where /3 < 1 is some constant close to 1 (see the proof of Theorem Q. 
until 3h£H [U{h, St) > at ■ (2/e - 1)]; 
output h £TL with the largest U{h, St)\ 



Fig. 1. Pseudo-code of our on-line sampling AdaSelect. 



Now we provide two theorems discussing the reliability and the complexity 
of the algorithm AdaSelect. Due to space restrictions, the proofs of the theorems 
are not provided in this version but can be obtained from the full version of this 
paper [^. 

^ The only assnmption is that we can obtain samples that are independently obtained 
from the same distribution, a natural assumption that holds for all the problems 
considered here. 
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The first theorem shows the reliability of Algorithm AdaSelect. 

Theorem 1. With probability 1 — <5, AdaSelect(A, e, <5) outputs a function 
h G Ti. such that U{h) > (1 — e)U{hif). 



Next we estimate the running time of the algorithm. 

Theorem 2. With probability 1 — <S, AdaSelect(A, e, halts within to steps 
(in other words, AdaSelect(A, e, (5) needs at most to examples), where to is 
the largest integer such that atg > [/(h*)(e/2). 



Let us express to in a more convenient form. Recall that /3 « 1. Then, since 
approximately at„ = 17(/i*)(e/2) and if we approximate a: « i/lnx by x « i/lni/., 
we have 



to 



/ cd \ ^ 1 ^ 2cdn \ 

[7u{h7)) \eU{h,)5) ■ 



Let us discuss the meaning of this formula. Note first that it is independent of 
the database size. Both n and 5 appear within a logarithm, so their influence to 
the complexity is small. In other words, we can handle a relative large number of 
rules and require a very high confidence without increasing too much the sample 
size needed. The main terms in the above formula are 1/e and (cd)/17(/i*). (Recall 
that U{h) < cd for any h G H] hence, both 1/e and (cd)/U{h^,) are at least 1.) 
Thus, AdaSelect will be advantageous in situations where (1) relatively large e 
is sufficient, which is the case in several applications, and (2) (cd)/f7(/i*) is not 
that large; this quantity is not bounded in general, but it may be not so large 
in lucky cases, which happen more often than the bad cases. As we will see in 
Section 0, a clever choice of U might allow us to choose a large e so that the 
overall number of examples might not be very large. 

As we promised in the introduction, we now briefly discuss the relation of 
our work with the work on sampling in |^. The authors distinguish between 
two types of sampling methods, static sampling methods that decide whether a 
sample is sufficiently similar to a large database ignoring the algorithm that is 
going to be use on the sample and dynamic sampling methods that take into 
account the algorithm being use. They perform several experiments concluding 
that the later method is preferred. Notice that our sampling method can be 
casted as a dynamic one since its performance depends on the choice of F which 
in turn is determined by the algorithm where the method is being used. They 
also introduce a criterion for evaluation sampling methods, the Probably Close 
Enough criterion, that uses an accuracy and a confidence parameter in a similar 
way as the goal we define here. 



3 Examples of Applications 

In this section we will describe two domains where an instance of our algorithm 
can be used to solve a particular problem. The two domains studied are model 
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or hypothesis selection and induction of decision trees. Due to space limitations 
we cannot further describe other possible applications but for problems like 
association rules mining (which we have already studied in |0) or subgroup 
discovery, where batch sampling methods based on large deviation methods have 
already been proposed to speed up the computational process (see for instance, 
IT^and fTTI ') our algorithm can also be applied. 

Before describing our applications, let us make some remark about the kind 
of problems where it can be applied. First, recall that our algorithm is designed 
for the General Rule Selection problem described in Section 0 This problem is 
relatively simple and, in general, we are not going to be able to use the algorithm 
as an overall sampling algorithm for a particular data mining algorithm since 
the algorithms used are usually too complex. However, as we will see below, 
some parts of a data mining algorithm can be casted as instances of the General 
Rule Selection problem and then, those parts can be done using our sampling 
algorithm instead of all the data. The problem of learning decision trees that 
we discuss below is a good example. While our algorithm does not provide a 
sampling method for obtaining the overall sample size necessary to grow a de- 
cision tree, it is applicable to a certain part of the algorithm, in that case, the 
part where the best split has to be chosen according to certain splitting criteria. 
Furthermore, we will assume that the rule U can be calculated in a incremental 
manner. That is, once a new example is obtained, this example can be added 
to S and U{S) can be recalculated efficiently from a previous value of U. While 
our algorithm might work even if this assumption is not true, it might be too 
inefficient to completely recalculate the value of U using all the sample at every 
step unless it is incremental in the sense just mentioned. All the applications 
that we described in the following satisfy this property. 

In order to instantiate our framework to a particular problem we need to 
specify what is the meaning of the function class Ti., what is the utility function 
U, and how conditions the conditions on U are satisfied. We will do so for the 
following problems. 

Model or hypothesis selection.- This is the most typical application of our 
framework. The class of functions TC can be seen as a fixed set of hypotheses 
or models. This set could be obtained, for instance, from different runs of ri- 
val learning algorithms, or the same learning algorithm but with different input 
parameters or arquitectures, or it could contain several different memory-based 
hypothesis or be just fixed to a certain restricted model space as a consequence 
of a design decision. Notice that this later case is the case of algorithms that 
try to select a hypothesis from a simple and small class of hypotheses (for in- 
stance, decision stumps) and then amplify its precision using voting methods like 
boosting 0. Thus, in the rest of the discussion, we will assume that the class 
Ti. is fixed, finite and of tractable size. The utility function should capture the 
criterion of ’’goodness” of the hypothesis, the typical one being the prediction 
error. In order to keep the discussion at simple level we will assume that the 
hypotheses are binary and that the “goodness” criteria is the prediction error. 
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For this case, one might naively set the utility function to be just the identity 
function, that is, U{h, s) = F{svg{yh^^ •. x & S)) = sxvg{yh,x ■■ x € S). 

Notice, however, that the worst possible prediction error is 1/2, which is 
the same as answering by flipping a random coin. Hence, more precisely speak- 
ing, for the “goodness” of hypothesis h, we should measure its advantage from 
from random guessing. Thus, we want our algorithm to output a hypothesis 
whose advantage from random guessing is e close to the best possible advan- 
tage in the class. In fact, this setting fits particularly well with voting methods 
like AdaBoost ^ where typically one just needs to obtain a hypothesis that 
is better than random guessing at every step. For this purpose, we should set 
U{h,s) = F{a,vg{yh^x '■ x G S')) = a,vg{yh,x '■ x G S) — 1/2. In particular, for 
selecting a hypothesis for voting methods, the choice of e may not be so impor- 
tant; we can set it to any constant smaller than 1. This is the situation that 
was studied in our preliminary paper [p. Our new and more general framework 
allows one to use more complicated utility functions that could incorporate size 
or smoothness considerations together with prediction error or for the case of 
real- valued functions we could also consider for instance mean square error. 

A similar setting was previously studied by Maron and Moore in m where 
they proposed an algorithm called Hoeffding races to accelerate model selection. 
Their idea was to discard hypotheses when there are clearly not going to be 
among the best ones. Their criteria for discarding them was based on inverting 
the Hoeffding bound, we refer the reader to their paper for details. We can 
clearly add this feature to our algorithm without compromising its reliability 
and complexity and possibly accelerating the total running time. Notice that 
combining Hoeffding races with our algorithm does not reduce much the number 
of examples needed since it depends logarithmically on the number of models 
n while it might greatly reduce the computation time, which depends linearly 
on n. 

In a follow up research, Moore and Lee M developed a more efficient version 
of Hoeffding races based on a Bayesian approach under the assumption that the 
the models accuracies over the dataset are normally distributed. Furthermore, 
they also introduce a modification for discarding models that are almost indis- 
tinguishable to others and thus, allowing the race to output a model that it is 
not the best in the class. Notice that this is also what we are doing through our 
accuracy parameter e. 

Decision tree induction.- Algorithms for decision tree induction typically 
work by choosing a test for the root from certain node function class (and subse- 
quently, for the root of each subtree) by exploring the training data and choosing 
the one that is the best according to certain splitting criteria like the entropy 
measure or the Gini index. In a large dataset it could be possible to reduce the 
training time by choosing the split based only on a subsample of the whole data. 
Musick, Catlett, and Russell M described an algorithm that implements this 
idea and chooses the sample based on how difficult is the decision at each node. 
Typically, their algorithm uses a small sample at the root node and it enlarges it 
progressively as the tree grows. Here, we propose an alternative way to select the 
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appropriate sample size needed at every node by using an instantiation of our 
algorithm that we describe in the following. We will follow the notation from 

For simplicity, we assume that we want to construct a decision tree that 
approximates and unknown function / : X —t {0,1} using a training data set 
S C X. Furthermore, we assume that the class of node functions F is fixed a 
priori and it is finite and small, for instance just the input variables and its 
negations0. In fact, this is the case for standard software packages such as C4.5 
or CART. Finally, we denote by G : [0, 1] ^ [0, 1] the splitting criteria used 
by the top-down decision tree induction algorithm. A typical example of this 
function is the binary entropy H{q) = —qlog{q) — (1 — < 7 )log(l — q). Let T be 
any decision tree whose internal nodes are labeled by functions in F and its 
nodes labeled by values {0, 1} and let I the leaf of T where we want to make 
a new internal node substituting leaf I by a function h from F. Our goal is to 
choose a function h such that the value of G in the original tree T (that is, the 
sum of the values of G in every leaf weighted by the probability of reaching that 
leaf) decreases by substituting Z by ft, in T and labeling the new labels Iq and l± 
according to the majority class of the instances reaching Iq and l\ respectively. 
More formally, let A C A be the set of instances that reach 1. For any x in A and 
any ft in F, we will denote by q the probability that f{x) is 1, by r the probability 
that h(x) is 1 , by r the probability that both ft(x) and f{x) are 1 and by r the 
probability that h{x) is 0 but f{x) is 1. Notice that these probabilities are taken 
from the distribution induced on A from the initial distribution on the training 
set S which is typically assumed to be the uniform distribution. Thus, given S', 
T, and a particular leaf I from T, our goal is to find the function ft that has the 
maximum value A{T,l,h), where A{T,l,h) = G{q) — (1 — r)G{p) + tG{t)] we 
denoted by ft* this function. 

As we mentioned, if we have a large dataset, a more efficient way to attack 
this problem is the following. For a given tree T and leaf I, take a sample from 
A and output the function ft that has the highest value A{T, /, ft) in the sample. 
If the sample size is chosen appropriately, the value of A(T,l,h), where ft is 
the function output by this algorithm, is close to A(T, I, ft*). We can apply our 
algorithm to use the appropriate amount of data as follows. We use F as Ti 
and U{h,X) as A{T,l,h) l^. It only remains to determine the constant c for 
the Lipschitz condition on U. Notice that U is just the addition of three G 
functions over different inputs. Thus, if G is Lipschitz for certain constant c 
so it is U. Here we just state how to obtain the Lipschitz constant for G and 
leave to the reader the calculation of the appropriate constant for the whole 
function U. If G{q) is the Gini index, then its derivative is G'{q) = 4(1 — 2q) 
and for any 0 < 9 < I, |G'(g)| < 4. Thus, by the Mean Value Theorem, the 
Lipschitz constant is 4. If G(q) is the binary entropy H(q) or the improved 



^ This assumption makes sense if all the attributes are discrete or have been discretized 
a priori. 

^ Technically speaking, U should depend only on the average of h{x) but here it also 
depends on the average of the labels y from the sample. It is easy to see that our 
algorithm also works for this situation. 
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splitting criterion G{q) = — q) presented in 0, its derivatives are not 

bounded in the [0, 1] range and therefore it cannot be a fixed constant that 
works for all the possible values. However, suppose that we ignore the input 
values very close to 0 or 1 and we consider, for instance, the interval [0.05, 0.95] 
then both functions have Lipschitz constant 5 in that interval. Observe that these 
extreme values of the splitting function can be disregarded in the case of decision 
tree growing, since any h much worse than that is good enough. Thus, we can 
now run our algorithm with inputs H, U, d as discussed above and the desired 
accuracy e and confidence level S; then the algorithm outputs, with probability 
larger than 1 — 5, a node function h such that A{T,l,h) > (1 — e) A{T , I , . 

Notice that again, the crucial point is in the choice of U. Moreover, according 
to the results in 0, any h such that A{T, I, h) > A{T, I, h^,)j2 should suffice to 
reduce the overall error of the whole tree after adding the new internal node; 
thus, we can fix e to be just 1/2. 

4 Concluding Remarks 

We have presented a new methodology for sampling that, while keeping all the 
theoretical guarantees of previous ones, it is applicable in a wider setting and 
moreover, it is very likely that it is useful in practice. The key point was the 
rather than deciding a priori the sample size and obtain it in batch, our algorithm 
performs sampling sequentially and maintains a stopping condition that depends 
on the problem at hand. In future work, we should verify the advantage of our 
approach experimentally. Although the theorem provided in this paper suggests 
that our algorithm might be efficiently applicable in several domains, there is 
still plenty of work to test whether this assertion is true or not. On one hand, our 
algorithm will take profit of non being in a worst case situation, and therefore, 
it is very likely that it will outperform the usual batch sampling approach. On 
the other hand, asymptotics alone do not guarantee practicality since a huge 
constant might spoil all the advantage of sampling over using all the data, even 
if the most efficient sampling algorithm is used. We have tested this second 
point in our preliminary work tm using synthetic data so we could test a wide 
range of values and the results are very promising. The number of examples used 
by our methods greatly outperform the number of examples used by the batch 
approach. Moreover, the number of examples was, in many cases, reasonable 
small so that it suggested that reducing the data size through sampling with our 
algorithm might allows us to improve the overall running time. The experimental 
results in although in a different context, are also very encouraging. In any 
case, it remains to do some experiments using real world data to test the real 
practicality of our algorithm. 

Furthermore, we have pointed our several improvements that one might per- 
form when applying the algorithm for a particular situation. Due to the gener- 
ality of our approach, every application deserves an individual and deeper study 
to see the possible enhancements that one can do for a particular setting. We 
discussed two settings (model selection and induction of decision trees) where 
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our algorithm is potentially useful. We also mention that there are many other 
where it can be applied. 
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Abstract. This paper presents an algorithm for discovering pairs of an 
exception rule and a common sense rule under a prespecified schedule. 
An exception rule, which represents a regularity of exceptions to a com- 
mon sense rule, often exhibits interestingness. Discovery of pairs of an 
exception rule and a common sense rule has been successful in various 
domains. In this method, however, both the number of discovered rules 
and time needed for discovery depend on the values of thresholds, and 
an appropriate choice of them requires expertise on the data set and 
on the discovery algorithm. In order to circumvent this problem, we 
propose two scheduling policies for updating values of these thresholds 
based on a novel data structure. The data structure consists of multiple 
balanced search-trees, and efficiently manages discovered patterns with 
multiple indices. One of the policies represents a full specification of up- 
dating by the user, and the other iteratively improves a threshold value 
by deleting the worst pattern with respect to its corresponding index. 
Preliminary results on four real-world data sets are highly promising. 
Our algorithm settled values of thresholds appropriately, and discovered 
interesting exception-rules from all these data sets. 



1 Introduction 

In knowledge discovery, rule discovery represents induction of local constraints 
from a data set, and is one of the most important research topics due to its 
generality and simplicity. However, many researchers and practitioners have re- 
marked that a huge number of rules are discovered, and most of them are either 
too obvious or represent coincidental patterns. Obtaining only interesting rules 
is an important challenge in this research area. 

With this in mind, several researchers worked on exception-rule discovery 
mmm- An exception rule represents a regularity of exceptions to a common 
sense rule which holds true, with high probability, for many examples. An excep- 
tion rule is often useful since it represents a different fact from a common sense 
rule, which is often a basis of people’s daily activities P] . For instance, suppose 
a species of mushrooms most of which are poisonous but some of which are ex- 
ceptionally edible. The exact description of the exceptions is highly beneficial 
since it enables the exclusive possession of the edible mushrooms. 

Exception-rule discovery can be classified as either directed or undirected. A 
directed method obtains a set of exception rules each of which deviates from a 
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user-supplied belief |3- Note that non-monotonic rule discovery Q belongs to 
this approach. Although this approach is time-efficient, a discovered rule tends 
to lack in unexpectedness. An undirected method obtains a set of pairs of an 
exception rule and a common sense rule 171819111)1 . Although this approach is less 
time-efficient than a directed method, it often discovers unexpected rules since 
it is independent of user-supplied domain knowledge. 

FED RE 13, which belongs to an undirected approach, obtains a set of rule 
pairs based on user-specified thresholds. FED RE is a promising method due to 
understandability of its discovery outcome. In this method, however, no rule pairs 
are discovered or too many rule pairs are discovered when specified thresholds 
are inappropriate to the data set. Therefore, a user should have expertise on the 
data set and on the discovery algorithm. 

In order to circumvent this problem, we propose novel scheduling mecha- 
nisms for updating values of thresholds. Effectiveness of the proposed method is 
demonstrated with real-world data sets. 



2 Undirected Discovery of Exception Rnles 

2.1 Problem Description 

Let a data set contain n examples each of which is expressed by m attributes. 
An event representing, either a single value assignment to a discrete attribute or 
a single range assignment to a continuous attribute, will be called an atom. We 
define a conjunction rule as a production rule of which premise is represented by 
a conjunction of atoms and the conclusion is a single atom. Undirected discovery 
of exception rules finds a set of rule pairs each of which consists of an exception 
rule associated with a common sense rule. Let x and x' have the same attribute 
but different attribute values, and A t/2 A • • • A Zj, = zi A ^2 A • • • A Zi/, 

then a rule pair r(a;, x', Y^, Z,j) is defined as a pair of conjunction rules as follows. 

r(x, x\ y^, Zy) = (Y^ —> X, Y^KZ^,^ x) (1) 

In undirected exception-rule discovery, evaluation criteria of a discovered 
pattern can be classified as a single criterion [^, thresholds specification 0, 
and an integration of these two Enni- The straightforward nature of thresholds 
specification makes this approach appealing. Evaluation criteria of a discovered 
pattern can be also classified as either considering confidence [prni or ignoring 
confidence Although confidence evaluation is useful when exception rules 
are numerous, confidence evaluation eliminates all patterns from the output 
in a data set with few exception rules. In this paper, we employ thresholds 
specification with no confidence evaluation, i.e. ( 0 - 

Pr{Y^) > 0f ( 2 ) 

I^(a;|y^) > MAX(6»f,I^(a;)) (3) 

^ In this paper, we use “scheduling” as appropriate update of threshold values. 
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( 4 ) 

( 5 ) 

( 6 ) 



PiiY^,Z,)>el 

Pv{x'\Z^) < MIN(6»^,I^(a;')) 

where Pr(cc) represents, in the data set, the ratio of examples each of which 
satisfies x. Besides, 0®, , 6*2 user-specified thresholds for generality 

Pr(y^) of the common sense rule, accuracy Pr(a;|y'^) of the common sense rule, 
generality Pr(F^,Z,^) of the exception rule, accuracy Pr(a:'|y^, of the ex- 
ception rule, and Pr(a;'|Z,y) respectively. Here, (0 and @ represent necessary 
conditions for a common sense rule to be regarded as a regularity. Likewise, 
@ and 0 are conditions for an exception rule. On the other hand, © rep- 
resents a necessary condition for a reference rule Z^ — > x' to be regarded as a 
non-regularity. Without this condition, discovered exception-rules are obvious. 

We show an example of a rule pair discovered from the 1994 common clinical 
data set, which will be described in section 

uric nitrate = > bacterial detection = negative 

-^sex = F, antiphlc^istic = true ^ bacterial detection = positive 
Pr(y^)_= 0.120, Pr(x|y^) = 0.737, ^^t{Y^,Z,) = 0.00100, = 

0.619, Pr(x'|Z^) = 0.335 

where a “-I-” represents the premise of the corresponding common sense rule. 
According to this rule pair, 12 % of the patients was “uric nitrate = -”, and 73.7 
% of them were negative in bacterial check test. It also shows that 1.00 % of 
patients were female and “antiphlogistic = true” in addition to “uric nitrate = 
and 61.9 % among them were positive in bacterial check test. This exception 
rule is unexpected and interesting since only 33.5 % of the patients who were 
female and “antiphlogistic = true” were positive in bacterial check test. 



2.2 Discovery Algorithm 

In an undirected discovery method for exception rules m, a discovery task is 
viewed as a search problem, in which a node of a search tree represents four 
rule pairs r(cc,a;',y^,Zi,), r{x' ,x,Y^, Z^), r(a;, x', Z^, Y"^), r(a;', a;, Z^, F^). Since 
these four rule pairs employ common atoms, this fourfold representation im- 
proves time efficiency by sharing calculations of probabilities of these atoms. 
Each node has four flags for representing rule pairs in consideration. 

In the search tree, a node of depth one represents a single conclusion x ov x' . 
Likewise, a node of depth two represents a pair of conclusions (a;, x'). Let /a = 0 
and 1 / = 0 represent the state in which the premise of a common sense rule and 
the premise of an exception rule contain no atoms respectively, then we see that 
fi = ly = 0 holds true in a node of depth two. As the depth increases by one, 
an atom is added to the premise of the common sense rule or of the exception 
rule. A node of depth three satisfies either (/r, u) = (0, 1) or (1, 0), and a node of 
depth I {> Y), ^ + V = I — 2 {jjL,v >1). 
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A depth-first search method is employed to traverse this tree, and the max- 
imum search-depth M is given by the user. Note that M — 2 is equal to the 
maximum value for ^ + v. Now, we introduce a theorem for efficient search. 

Theorem 1. Let Pr(Y'^/) > Pr(Y^), Pr(Z^/) > Pr(Zi^) and a rule pair 
r(x,x' ,Y^i , Z^i) satisfies at least an equation in (0) - (im), then a rule pair 
r{x,x' ,Yf^, Z,y) does not satisfy all equations in ^ - (Ej). 



(7) 

Pr(x, V) < 0fMAX(0f,^(x)) (8) 

Pv{Y^,,Z,.)<el (9) 

P?(x', < 6»fMAX(6»^,I^(a;')) (10) 

Pv{Z^,) < 6»^MAX(6l^,ft(a;'))/MIN(6»^,I^(x')) (U) 



Proof Assume the rule pair r{x,x' ,Y^, Zi,) satisfies (0 - ® and one of the 
equations in o - mi) holds. First, m gives Pr(Y)j/) > Pr(Y)j) > 0f, which 
contradicts to (EJ. Contradiction to Q can be derived similarly from (^. Sec- 
ond, O gives Pr(x|Y)i) = Pr(a:, Y)i)/Pr(F^) > MAX(0f , Pr(x)), then from (0, 
Pr(a;,y)i') > Pr(a;,y)j) > 0®MAX(6lf , Pr(x)), which contradicts to (0. Contra- 
diction to (Irnl) can be derived in a similar way. Third, (0) gives Pi{x'\Z^) = 
Pr(x', Zy)/Pr(Zj,) < MIN( 02 , Pr(a^O); l^^n from (0 and ©, Pr{Z,^i) > Pr(Z,^) 

> Pv{x' ,Z^)lMm{9\,Pv{x')) > Pv{x’ ,Y^,Z^)/Mm{e\,Pv{x')) 

> 02 MAX( 0 f, Pr(a;'))/MIN( 6 l 2 , Pr(a;')). However, this contradicts to (iTTt . 

If a rule pair r{x,x' ,Y^i , Z^i) satisfies at least one of (0 - (iiill . the flag 
which corresponds to this rule pair is turned off. In expanding a search node, 
we define that information about flags is inherited to its children nodes. Thus, 
r(a;, x' , Zi,) is ignored in the descendant nodes without changing the output. 
If the four flags are all turned off in a search node, the node is not expanded. 



2.3 Selection of an Atom for a Continuous Attribute 

In this section, we explain how to select an atom given an attribute x. If x is nom- 
inal, our algorithm considers all single- value assignments to it as atoms. On the 
other hand, if x is continuous, the algorithm considers single-range assignments 
to it as atoms, and these ranges are obtained by discretizing x. 

Discretization of a continuous attribute can be classified as either global (done 
before hypothesis construction) or local (done during hypothesis construction) 
0. It can be also classified as either supervised (considers class information) or 
unsupervised (ignores class information) 0. In this paper, we employ a global 
unsupervised method due to its time efficiency. 

The equal frequency method, when the number of intervals is k, divides n 
examples so that each bin contains n/k (possibly duplicated) adjacent values. 
The equal frequency method belongs to unsupervised methods and is widely 
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used 0. We obtain the minimum value and the maximum value of a range by 
the global equal- frequency method. 

In order to exploit stopping conditions presented in the previous section, 
ranges are selected as follows. Note that there are two possible ranges as a range 
which consists of fc — 1 adjacent intervals. Our algorithm first selects each of 
these two as ranges, then selects the ranges of fc — 2 adjacent intervals. Note that 
stopping conditions O - ED are employed in ignoring unnecessary ranges in 
the latter selection. This procedure is interated by decrementing the number of 
adjacent intervals until either fc = 1 or no ranges are left for consideration. 

3 Threshold Dependencies 

The method 0 presented in the previous section has discovered interesting 
knowledge in various domains. For example, in the meningitis data set mi, 
a domain expert claims that knowledge representation of the method is very 
appealing, and discovered knowledge is very interesting m- 

In this method, however, specification for values of thresholds 0®, , 02 , 6\ 

is difficult in some cases. For example, if the data set contains few exception rules, 
strict values result in few or no discovered patterns. We can avoid this situation 
by settling the values loosely, but if the data set contains many exception rules, 
too many patterns would be discovered at the end of long computational time. 
Both cases are undesirable from the viewpoint of knowledge discovery. Especially 
for a data set with many exception rules, finding appropriate values for thresholds 
often requires several reapplication of the methods. 

In order to make this problem clear, consider rule discovery methods |gl7E] 
each of which has a single index for evaluating a pattern. In these methods, single 
index allows user to specify the number of discovered patterns since they have a 
total order. Therefore, such a method discovers patterns from any data set, and 
can efficiently manage these patterns with a data structure such as heap. 

Problems in our method 0 come from the fact that there are five indices 
for evaluating a pattern. Therefore, how to update values of thresholds is not 
obvious, and it is difficult to manage discovered patterns efficiently. Note that a 
“perfect” discovery is possible by selecting an interesting subset of rule pairs, and 
by determining threshold values , 0f , 0f , , 9\ so that this subset is contained 

in the output. However, this is impossible prior to discovering rule pairs. 



4 Scheduled Discovery of Rule Pairs 

4.1 Data Structure for Discovered Patterns with Multiple Keys 

Operations required for a data structure in knowledge discovery are addition of 
discovered patterns, search for unnecessary stored-patterns and deletion of them. 
Typically, an unnecessary stored-pattern corresponds to the minimum element. 
Therefore, efficiency is required for addition, search, deletion and minimum- 
search in such a data structure. 
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/ 

★ it6 



rl (0.02,0.67,0.0009,0.55,0.40) 
r2 (0.11,0.59,0.0009,0.55,0.32) 
r3 (0.06,0.67,0.0006,0.57,0.40) 
r4 (0.07,0.59,0.0006,0.57,0.32) 



r5 (0.06,0.67,0.0030,0.68,0.40) 
r6 (0.11,0.59,0.0030,0.68,0.32) 
r7 (0.06,0.67,0.0004,0.60,0.34) 



Fig. 1. Illustration based on proposed data structure where an -/WL tree is em- 
ployed as a balanced search-tree, and keys are si (Pr(F,,)), s2 (Pr(Y^,Zy)), cl 
(Pr(a;|y),)), c2 (Pr(a;'|F,,, Z„)), c3 (Pr(x'|Z,,)). Numbers in a pair of parenthe- 
ses represent values of these indices of the corresponding rule pair. In a tree, 
*rl, *r2, • • • , *r7 represent a pointer to rl, r2, • • •, r7 respectively. 



In order to cope with the problems in the previous section, we have invented a 
novel data-structure which efficiently manages discovered patterns with multiple 
indices. This data structure consists of balanced search-trees. Many balanced 
search-trees are known to add, search, delete and minimum-search its elements 
in 0(log x) time, where y is the number of elements. In order to realize flexible 
scheduling, we assign a tree for each index. A node of a tree represents a pointer 
to a discovered pattern, and this enables fast transformation of the tree. We 
show, in figure Q an example of this data structure which manages seven rule 
pairs rl, r2, • • • , r7. 

Now, we should determine a balanced search-tree and keys. For a tree, a B 
tree is appropriate if we store a huge number of rule pairs on external memory, 
and interactively post-process them. On the other hand, if the number of rule 
pairs is moderate (say, below one million), an AVL tree sufficient. In this 

paper, we assume that time needed for discovery is realistic, and the number 
of the discovered rule-pairs is moderate. Therefore, we used an AVL tree as a 
balanced search-tree. 

Appropriate selection of keys is an interesting problem, which should be re- 
solved according to characteristics and usefulness of discovered rule-pairs. How- 
ever, we leave this topic for further research, and employ si (Pr(Y),)), s2 
cl (Pr(a:|F,,)), c2 (Pr{x'\Yf,, Z^)), c3 (Pr{x'\Z^)) as indices. 

It should be noted that these restrictions for a tree and keys are intended 
for a clear presentation of this paper, and our data structure can be easily used 
with other trees and keys. Also, we restrict the number of elements in an AVL 
tree to rj for realizing scheduling policies described in the next section. 
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4.2 Scheduling Policies 

In updating values of thresholds, users either have a concrete plan or have no 
such plans. We show two scheduling policies for these situations. 

In the first policy, a user specifies ^ pairs ( 6 >i, vi), ( 02 , ^ 2 ), of a 

threshold 9i and its value Vi, where 9i is one of 0 ®, 0 f , , ^2 , ^ 2 - The method 

starts with i = 1. Suppose rj pointers are stored in each AVL tree, and a new rule 
pair is being discovered. According to the ith pair (0j, Wj), value of this threshold 
is updated to 9i = Vi, and i is incremented. At this time, a stored rule-pair which 
satisfies 0i < Vi is unnecessary since it is not outputted. Our algorithm deletes 
nodes each of which has a pointer to such a rule pair, and also disallocates space 
for such a rule pair. Note that this can be done in 0(log rj) time using the AVL 
tree for 0 ^ if the number of deleted pointers can be regarded as a constant. 

In the second policy, a user specifies a total order to a subset of indices 
si, s2, cl, c2, c3. This method iteratively deletes a rule pair based on an AVL 
tree which is selected according to this order. When the last index in the order 
is chosen, the top index will be selected next. Suppose rj pointers are stored in 
each AVL tree, and a new rule pair is being discovered. If the current index is c3, 
the maximum node of the AVL tree for c3 is deleted. Otherwise, the minimum 
node of current AVL-tree is deleted. We also delete, in the other AVL trees, 
nodes each of which has a pointer to this rule pair, and disallocate space for this 
rule pair. Note that this can be also done in 0(log 77 ) time. Let v be the value of 
the current index of the deleted rule-pair, then the value of the corresponding 
threshold is updated to 9 = v. 

Both of these methods have strong and weak points. While the first policy 
allows full specification of thresholds updating, no rule pairs are discovered if the 
specified values of thresholds are too strict, and more than rj nodes are stored 
in an AVL tree if the values are too loose. On the other hand, the second policy 
always discovers a rule pair if the initial thresholds are loosely-settled, and it is 
free from the overflow problem. Note that this method corresponds to deleting a 
rule pair based on its weakest index. Suppose a rule pair most of which indices 
are excellent in the context of exception-rule discovery. However, if at least one 
of its indices is ranked under the (77 -I- l)th, this rule pair is not outputted. 

Note that these shortcomings come from the nature of considering multiple 
keys simultaneously. Experience has taught us that these problems can be eas- 
ily avoided by modifying specifications, and the number of applications of our 
algorithm to a data set is typically at most three. 

5 Experimental Evaluation 

In order to evaluate the effectiveness of the proposed approach, we have con- 
ducted various experiments. Here, we assume that the user has no concrete plans 
for updating values of thresholds. Unless stated, five indices si, s2, cl, c2, c3 are 
specified in this order with the second scheduling policy presented in the previ- 
ous section. In the experiments, we settled the maximum number of discovered 
rule-pairs to rj = 500 and the maximum search-depth to M = 5. 
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5.1 Data Sets with a Large Number of Rule Pairs 

The mushroom data set ^ includes 21 descriptions and the edibility class of 
8124 mushrooms. All the attributes are nominal, and each of them has two 
to twelve values. While there are no restrictions on attributes of atoms in our 
approach, users may also impose some constraints on them. In this experiment, 
for example, the edibility is the only attribute allowed in the conclusions. Since 
this data set has a smaller number of examples than the other data sets, we 
settled the maximum search-depth to M = 6. 

A large number of rule pairs can be discovered from the mushroom data set. 
However, a user does not know this fact at analyzing this data set for the first 
time. We settled initial values of the thresholds loosely to 0® = 0.0004, 0® = 
10/8124, of = 0.5, of = 0.5, O 2 = 0.5, and applied the proposed method. As 
the results, 500 rule pairs were discovered, and the final values of the thresholds 
were Of = 0.154, Of = 0.00295, Of = 0.914, Of = 1.000, 0\ = 0.123. In the 
application, 5.00* 10® nodes were searched, and the number of rule pairs stored 
in the data structure (including those which were deleted) were 4069. 

Below, we show an example of rule pairs discovered in this experiment, where 
a “-I-” represents the premise of the corresponding common sense rule. We also 
substitute nsl (= nPr(l/j)), ns2 (= nPr(l/, and ns2c2 (= nPr(xfY^,Z,^)) 
for si, s2 and c2 since they are easier to be understood. Some readers may re- 
quest accuracies for randomly-chosen examples. However, we omit such analysis 
and show examples of discovered rule-pairs since it is generally agreed that se- 
mantic evaluation is of prime interest in exploratory rule discovery 

sp-color = w, population = v ^ class = poisonous 
-I- stalk-shape = e, stalk-root = b — *■ class = edible 
nsl = 1824, cl = 0.964, ns2 = 64, ns2c2 = 64, c3 = 0.0576 



According to this rule pair, 96 % of the 1824 mushrooms whose “sp-color” is 
“w” and whose “population” is “v” are poisonous. However, 64 of these (whose 
“stalk-shape” is “e” and whose “stalk-root” is “b” in addition to above condi- 
tions) are all edible. This exception rule is unexpected and interesting since only 
5.8 % of the mushrooms whose “stalk-shape” is “e” and whose “stalk-root” is 
“b” are edible. 

In undirected exception-rule discovery, computational time is approximately 
proportional to the number of searched nodes. The above experiment required 
many nodes to be searched, and thus its computational time was relatively long. 
However, we can reduce time with a simple strategy: if we notice that many rule 
pairs are likely to be discovered, we interrupt the process and restart the algo- 
rithm with stricter initial values of thresholds. We have applied our method with 
stricter initial values Of = 0.1, Of = 100/8124, Of = 0.7, Of = 0.9, Of = 0.5 by 
varying the maximum number of rule pairs rj = 100, 200, • • • , 500. Table Qshows 
numbers of searched nodes, numbers of rule pairs stored in the data structure 
(including those which were deleted), and the final values of the thresholds. We 
see, from the table, that the final values of the thresholds are valid, and the 
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number of nodes are reduced to 44 - 56 %. Final values of thresholds represent 
distributions for indices si, s2, cl, c2, c3 of rule pairs. We, however, skip detailed 
analysis of these distributions due to space constraint. 



Table 1. Resulting statistics with respect to the maximum number rj of rule 
pairs in the experiments with the mushroom data set, where initial values were 
settled as = 0.1, = 100/8124, 6»f = 0.7, = 0.9, 6l = 0.5 



HI 


^ of nodes 


^ of rule pairs 




mm 


mm 


02 






2.28*10'" 


734 


0.177 


BliiEl 


0.783 


1.000 








1250 


0.211 


QQQ 




1.000 








1662 


0.248 


BlilH 




iHililil 






2.71*10'" 


1893 


0.241 






1.000 








2245 


0.217 


BlilEl 


0.794 


iHililil 





Two things should be reminded in using this method. First, in the table, final 
values of thresholds 0®, do not decrease monotonously with rj since these 

values are dependent on the order of rule pairs stored in the AVL trees. Next, 
the rule pairs shown above are not discovered in these experiments since their 
s2s are too small for the initial value = 100/8124. Note that initial values of 
thresholds represent necessary conditions which are specified by the user, and 
influence the discovery outcome. 

In order to investigate the effectiveness of updating values of thresholds, 
we have applied our method without deleting a node in the AVL trees with 
initial values 9f = 0.0004, 6»f = 10/8124, = 0.5, = 0.5, 9l = 0.5. In 

this experiment, 8.94 * 10® nodes were searched, and 1.48 * 10® rule pairs were 
discovered. We see that without updating values of thresholds, computational 
time nearly doubles, and the number of discovered rule-pairs is huge. 

The 1994 common clinical data set, which was originally supplied by a med- 
ical doctor and was preprocessed by the author, includes 135 attributes about 
20919 patients in a hospital. Two attributes are continuous, and the other at- 
tributes are nominal with 2 to 64 values. In this experiment, bacterial detection 
is the only attribute allowed in the conclusions. 

We settled initial values of the thresholds loosely to 0® = 0.0004, = 

10/20919, = 0.5, = 0.5, 9\ = 0.5, and applied the proposed method. As 

the results, 500 rule pairs were discovered, and the final values of the thresholds 
were 9f = 0.0606, 6»f = 0.000908, 6»f = 0.671, 9^ = 0.526, 9l = 0.339. In the 
application, 2.32* 10^ nodes were searched, and the number of rule pairs stored 
in the data structure (including those which were deleted) were 2691. Below, we 
show an example of rule pairs discovered in this experiment. 

department = internal medicine ^ bacterial detection = negative 

-I- fever = N/A, uric WBC = 1-1 > bacterial detection = positive 

nsl = 2714, cl = 0.679, ns2 = 25, ns2c2 = 19, c3 = 0.321 























Scheduled Discovery of Exception Rules 193 



According to this rule pair, 68 % of the 2714 patients whose department 
was internal medicine were negative in bacterial check test. However, 19 among 
25 of them (whose “fever” was “N/A” and whose “uric WBC” was “1+” in 
addition to above conditions) were positive. This exception rule is unexpected 
and interesting since only 32 % of the patients whose “fever” was “N/A” and 
whose “uric WBC” was “1+” were positive. 

Compared with the results of the mushroom data set, the final value of s2 
and of c2 in this experiment are small. As presented in section PT~21 a user specify 
a subset of indices si, s2, cl, c2, c3, and only values of the corresponding thresh- 
olds for chosen indices are updated. Here we show the result of choosing s2 and 
c2. In this experiment, 2.17 * 10^ nodes were searched, and 4106 rule pairs were 
stored in the data structure (including those which were deleted). Final values 
of the thresholds were = 0.00277, Of = 0.655. Below, we show an example of 
rule pairs discovered in this experiment. 

drug for repression = true, unknown-fever disease = false ^ bacterial detec- 
tion = negative 

-I- ward = 7th floor east — *■ bacterial detection = positive 
nsl = 303, cl = 0.584, ns2 = 77, ns2c2 = 59, c3 = 0.326 



According to this rule pair, 58 % of the 303 patients who took drug for 
repression and whose disease was not unknown fever were negative in bacterial 
detection test. However, 59 among 77 of them (whose ward was in the east of 
the 7th floor in addition to above conditions) were positive. This exception rule 
is unexpected and interesting since only 33 % of the patients whose ward was in 
the east of the 7th floor were positive. This rule pair has larger s2 and c2 than in 
the previous experiment, and this fact demonstrates the effect of limiting indices. 

In order to investigate the effectiveness of updating values of thresholds, 
we have applied our method without deleting a node in the AVL trees with 
initial values Of = 0.0004, Of = 10/20919, Of = 0.5, Of = 0.5, 0l = 0.5. In this 
experiment, 2.72*10^ nodes were searched, and 90288 rule pairs were discovered. 
We see that a huge number of rule pairs are discovered in this case. 



5.2 Data Sets with a Small Number of Rule Pairs 

The earthquake questionnaire data set 0 includes 34 attributes about 16489 
citizens during the 1995 Hyogo-ken Nambu Earthquake, which killed more than 
4500 people. All attributes are nominal, and each of them has five to nine val- 
ues. In this experiment, damage of building is the only attribute allowed in the 
conclusions. 

We settled initial values of the thresholds loosely to Of = 0.0004, Of = 
10/16489, of = 0.5, of = 0.5, Of = 0.5, and applied the proposed method. As 
the results, 38 rule pairs were discovered, and 5.99 * 10^ nodes were searched. 
Below, we show an example of rule pairs discovered in this experiment. 
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ground = not firm, shake unstable = a little ^ damage building = no 
+ damage wall = partially broken ^ damage building = picture leaned 
nsl = 317, cl = 0.791, ns2 = 13, ns2c2 = 8, c3 = 0.111 

According to this rule pair, 79 % of the 317 citizens who were on the ground 
which was not firm and who saw unstable things shaking a little suffered no 
damage to their building. However, 8 among 13 of them (whose wall was partially 
broken in addition to above conditions) had their pictures leaned. This exception 
rule is unexpected and interesting since only 1 1 % of the citizens whose wall was 
partially broken had their pictures leaned. 

The 1994 census data set includes 14 attributes about 48842 U.S. citizens for 
predicting whether their annual salaries are greater than 50000 $. Six attributes 
are continuous, and the other attributes are nominal with 2 to 42 values. In this 
experiment, annual salary is the only attribute allowed in the conclusions. 

We settled initial values of the thresholds loosely to 0® = 0.0004, = 

10/48842, = 0.5, = 0.5, and applied the proposed method. 

As the results, 8 rule pairs were discovered, and 9.48 * 10^ nodes were searched. 
Below, we show an example of rule pairs discovered in this experiment. Due to 
controversies for discovering race-related knowledge, we substitute X for a coun- 
try name. 

28 < age < 46, native country = X ^ annual salary = < 50K $ 

-I- education = Some college ^ annual salary = > 50K $ 
nsl = 48, cl = 0.833, ns2 = 13, ns2c2 = 7, c3 = 0.189 

According to this rule pair, 83 % of the 48 citizens who were born in country 
X and who were 28 to 46 years old had an annual salary under 50K $. However, 7 
among 13 of them (whose final education was “Some-college” in addition to above 
conditions) had an annual salary over 50K $. This exception rule is unexpected 
and interesting since only 19 % of the citizens whose final education was “Some- 
college” had an annual salary over 50K $. 

6 Conclusions 

In this paper, we proposed a scheduling capability for simultaneous discovery of 
an exception rule and a common sense rule based on a novel data structure. The 
method updates values of multiple thresholds appropriately according to the 
discovery process. Therefore, appropriate number of rule pairs are discovered 
in reasonable time. Moreover, the novel data structure is useful for efficient 
management of discovered patterns with multiple indices. 

Our method has been applied to four data sets, which have different char- 
acteristics with respect to the number of discovered rule-pairs. Experimental 
results clearly show that our method is effective for data sets with many rule 
pairs as well as for data sets with few rule pairs. Ongoing work mainly focuses 
on discovering useful knowledge in collaboration with domain experts. 
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Abstract. For several years, Inductive Logic Programming (ILP) has 
been developed into two main directions: on one hand, the classical sym- 
bolic framework of ILP has been extended to deal with numeric values 
and a few works have emerged, stating that an interesting domain for 
modeling symbolic and numeric values in ILP was Constraint Logic Pro- 
gramming; on the other hand, applications of ILP in the context of Data 
Mining have been developed, with the benefit that ILP systems were 
able to deal with databases composed of several relations. 

In this paper, we propose a new framework for learning, expressed in 
terms of Constraint Databases: from the point of view of ILP, it gives 
a uniform way to deal with symbolic/numeric values and it extends the 
classical framework by allowing the representation of infinite sets of posi- 
tive/negative examples; from the point of view of Data Mining, it can be 
applied not only to relational databases, but also to spatial databases. 
A prototype has been implemented and experiments are currently in 
progress. 



1 Introduction 

In Inductive Logic Programming, most discrimination learning tasks can be 
stated as follows: given a domain theory defining basic predicates gi, ..., g/, 
a set of target predicates pi, . . . , Pm defined by positive and negative examples, 
find a logic program, expressed with the predicates gi, . . . , qi, pi, . . . , Pm, defin- 
ing the target predicates, covering their positive examples and rejecting their 
negative ones. 

Examples are usually expressed by ground atoms pi(ai, . . . , a„), where 
fli , . . . , a„ denote constants. The domain theory is either extensionally defined 
by a set of ground atoms, or intentionally defined by a logic program. In this 
last case, a model, expressed by a set of ground atoms under closed world as- 
sumption, is computed, and only this model is used for learning. Nevertheless, 
when the representation language contains function symbols other than con- 
stants, this model may be infinite. To overcome this problem, either a finite, but 
incomplete, model is computed, usually by limiting the depths of the derivation 
trees, or the underlying representation language is merely restricted to Datalog 
(or Datalog^^^^) rules, meaning that no function symbols but constants are used. 
Nevertheless, this leads to important drawbacks: 
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— Function symbols must be translated into predicate symbols expressing the 
graph of the function. For instance, the function successor must be repre- 
sented by a binary predicate succ{X, Y) (succ{X, Y) is true, when Y is the 
successor of X), 

— A finite set of constants is given, thus when dealing with numeric values, 
compelling to provide a bound on these values. If we consider again, the suc- 
cessor function and if we suppose that the domain is restricted to {0, 1, 2, 3}, 
we must either express that 3 has no successor, or introduce, as done in the 
system Foil a new value ui expressing that the successor of 3 is out of 
range. 



Such requirements can lead to learn logic programs which take advantage 
of the closed world nature of the representation and which are incorrect when 
applied outside the finite domain. To deal with this problem, mostly generated 
by numeric values, it has been proposed l|lU|i2|ltilHllYlbi to model the learning 
task in the field of Constraint Logic Programming (CLP) We propose a new 
framework which is also based on the notion of constraints, but which takes into 
account the database dimension, by means of Constraint Databases 0. 



This new domain has emerged in the field of Databases to benefit from the 
declarative power of constraints and from the introduction of domain theories to 
express relations between variables. The use of constraints is very important for 
modeling concepts the extensions of which are infinite, and therefore Constraint 
Databases are of particular interest for modeling spatial or temporal information. 
In this paper, we present a new model for Inductive Logic Programming that 
enables to deal with numeric and symbolic information: examples are represented 
by conjunctions of constraints and the domain theory is also expressed by sets of 
constraints. This model generalizes the classical one: a target concept defined by 
a predicate p with n places can be viewed as a relation p over n variables, namely 
Xi, . . . , X„, an example usually expressed by a ground atom p{a\, . . . ,an) can 
be viewed as a generalized tuple Xi = oi A . . . A X„ = a„ and, when a model of 
the domain theory is computed before learning, this model expressed by a set of 
ground atoms can also be viewed as a set of constraints. 



The paper is organized as follows. Section 2 is devoted to the basic notions 
of Constraint Databases. In Section 3, we propose a framework for learning in 
Constraint Databases and the underlying notions of coverage and generality are 
defined. The system we have developed is then presented (Section 4). 



2 Constraint Databases 

2.1 Framework 

From the point of view of CLP, Constraint Databases represent another way of 
considering constraints, which take into account the notion of storing data. As 
in CLP, it is built on a language for expressing constraints. Let us denote by 77c 
(respectively Sc) the set of constrained predicate (respectively function) symbols 
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used for expressing constraints and S = (ifcTJc). We assume that Uc contains 
at least the equality operator (=) between variables, or between a variable and 
a constant of the domain. Terms and constraints are syntactically defined as 
follows: 

A term is a variable or a constant; if / is a function symbol of with arity n 
and if ti, . . . , tn are terms, then /(ti, . . . , tn) is a term. 

A constraint is an expression p(fi, . . . ,tn), where p € Uc and t\, . . . , are 
terms. 

To compute the semantic of constraints, a A-structure T> must be defined. It 
is composed of a set D representing the domain of the constraints, and for each 
function symbol of Sc and for each predicate symbol of Uc, an assignment of 
functions and relations on the domain D which respect the arities of the symbols. 
T> is referred as to the domain of computation. 

From the point of view of Databases, Constraint Databases extend both 
relational, deductive and spatial databases, allowing the definition of infinite 
sets by the introduction of generalized k-tuples, as in |^. 

A generalized k-tuple t is a quantifier-free conjunction of constraints, over a set 
Ai, . . . , Xk of variables ( the variables occurring in the generalized tuple t). For 
sake of simplicity, in the following, X will denote the variables Ai , . . . , X^ ■ 

For instance, let us consider Uc = {=, <, <} and Sc = {0, 1, -I-}. Let us write n 
for the abbreviation of 1 -I- ... -I- 1, n times. Let D be the set of integers and T> 
the A-structure over D, which interprets the symbol of A as usual. 
A<3A5<AAA <8 is a generalized 2-tuple over X and Y. 

A generalized relation R of arity k over the variables X\, .. ., Xk is a finite 
set of generalized k-tuples over Xi, ..., Xk- Therefore, it is a disjunction of 
conjunctions of constraints. 

For instance, a generalized relation R over the variables X and Y can be defined 
by { (A < 3 A 5 < A A A < 8), (10 < A A A < 20) }. Hence a generalized tuple 
of arity fc is a finite representation of a possibly infinite set of tuples of arity k. 

Let us notice that, if we consider a classical relation R defined on the vari- 
ables A and A, then for instance, the tuple i?(3,4) can be represented by the 
generalized 2-tuple A = 3 A A = 4, and the whole relation R can be viewed as 
a finite set of such generalized 2-tuples. . 

Finally, a constraint database (or generalized database) is a finite set of general- 
ized relations. 

Let T> be the domain of computation. The extension of a generalized k-tuple t 
over Ai, . . . , Xk, denoted by ext{t) is the set of tuples < a\, . . . ,Uk > such that 
the valuation X\ = a\, , Xk = Ofc satisfies t. 

In the same way, the extension of a relation R is defined by: 

ext{R) = \Jt^next{t) 
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Let us notice that the extension of a generalized tuple t or, of a generalized 
relation R, may be infinite. 

Codd’s relational algebra had been extended to constraint databases. The clas- 
sical algebraic operations, such as join, projection, selection, . . . are defined in 
terms of constraints, for a specific domain of computation |2| . 



2.2 Join 

Let Ri be a generalized relation over variables X, and R 2 be a generalized 
relation over variables Y , and Z = X UY , then the natural join of R\ and i? 2 , 
denoted by Ri N i? 2 , is defined on Z by: 



i?i N i?2 = {ti A t2\h e i?i, f2 G R2,ext{ti A 12 ) ^ 0} 



Example 1. Let Circles be a relation over the variables IDcircie, X and Y, 
defined as follows {{IDcircie = \ t\{X — 2^ + Y"^ < 1), {IDcircie = 2 {X + 
3)^ + Y'^< 4)}. In such a relation a tuple, as for instance IDcircie = 1 A {X — 
2)^ -I- < 1 represents the set of points (X,Y) belonging to the circle with 

index 1 and defined by the equation {X — 2)^ -|- < 1. Let us consider also 

the relation Rectangles defined by {IDuect = lA0<Af<2A0<E < 25} 
then Intersect = Circles N Rectangles = {IDcircie = 1 A {X — 2)^ -|- Y^ < 
1 A ID fleet = lA0<Af<2A0<F < 25} represents a relation over IDcircie, 
ID fleet, X and Y , such that the circle and the rectangle respectively defined by 
IDcircie and IDjiect intersect and the points defined by {X, Y) belong to that 
intersection. Let us insist on the fact that the tuple IDcircie = 2A(X-|-3)^-|-E^ < 
4 A ID fleet = lA0<Af<2A0<F <25 does not belong to Intersect, since 
its extension is empty. 



2.3 Selection 

The selection of a relation R (over variables X) on a. constraint c over variables 
Y, with Y Q X, denoted by (Jc{R), is defined by: 

<Jc{R) = {t A c\t G R, ext{t A c) 0} 



Example 2. Let us consider again the relation defined in Example 1. 

(^(iDcirau=2){Intersect) = {} 

a^x<-° 2 )lcircle) = [IDcircie = 2 A (X + 3)^ + < 4 A X < -2}. 

The relation <J(^x<- 2 ){CHrcle) contains a unique tuple. It represents the area of 
the second circle, for which X < — 2. The first circle has been deleted, since no 
point in it satisfies X < — 2. 
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2.4 Projection 

If i? is a generalized relation over variables X, and Z C X, then the projection 
of i? on Z denoted by II ^{R) is defined by: 

n^{R) = {n^{t)\t e R} 

where II ^{t) represents the projection of the tuple t onto the variables Z. 

Example 3. With Circles defined in Example 1, II[ju^.^^^^ x]iCircles) is 
{{IDcircle = 1 A 1 < X < 3), {IDcircle = 2 A -5 < X < -1)}. 

Example 4. Let us consider now the relation R defined by { {X = Z + 1 AY = 
Z + 2), (X = 2)} then n[x,Y](R) = {{X = Y - 1), (X = 2)}. 

2.5 Union 

The union of two relations is defined by : 

U i ?2 = {t\t G Ri or t G R 2 } 



3 A New Framework for Learning in Constraint 
Databases 

3.1 Coverage - Generality Order 

A generalized k-tuple t over the variables Xi, . . . , X^ is satisfiable if there exists 
an assignment a of elements of D to the variables of t such that a.t is evaluated 
to true. In other words, it is satisfiable, when its extension is not empty. 

We can now compare a generalized tuple over variables X and a generalized 
relation over variables Y. 

A generalized tuple t over variables X is covered by a generalized relation R over 
variables Y, when ext{II C ext{II ■ 

For instance, let us consider the generalized relation R, composed of the unique 
tuple A < 1. The tuple X = 1 AY = lis covered by the relation R. 

A generalized relation R is covered by a generalized relation R' , if \/t G R, t is 
covered by R' . In that case, R' is said to be more general than R { R is more 
specific than i?' ) . 



Let us notice that this definition is not equivalent to the following one: for any 
tuple t of R, there exists a tuple t' of R' , such that t is covered by t' . 
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Satisfiability, Consistency checking When a generalized tuple t is grounded, i.e., 
when \ext(t)\ = 1, then t is covered by a relation R if there exists t^ in R, 
such that the constraint t A tji is satisfiable. Therefore, in that case, testing 
whether a tuple is covered or uncovered requires a complete satisfiability test (a 
satisfiability test is complete if for every constraint, the test answers yes if the 
constraint is satisfiable, or false otherwise). 

For some domains, as for instance, linear real equality constraints, finite tree- 
constraints, or sequence constraints, complete solvers exist. Nevertheless, it may 
happen that a complete solver exists, but is untractable, and that for reasons of 
efficiency, an incomplete solver is used. 

3.2 Learning Setting 

Given: 

— a generalized relation Rt over a set of variables, composed of generalized 
tuples that can be labelled either as positive tuples, or as negative tuples. Let 
us insist on the fact that a positive (respectively negative ) tuple represents 
a possibly infinite set of positive (respectively negative) examples. 

— a concept description language £, specifying syntactic restrictions on the 
intensional definition of the target relation Rt, 

— a set BK oi generalized relations Ri, i = 1, . . . ,n, specifying background 
knowledge and which can be used in the definition of Rt, 

Find: 

— a generalized relations Rh expressed in terms of (generalized-) relational op- 
erations that is complete and consistent, when completeness and consistency 
are defined as follows. 

Completeness: Rh must cover every positive tuples in Rt 
Consistency: Rh must cover no negative tuples in Rt- 

This framework generalizes the classical ILP setting, since the representation 
of tuples by means of constraints enables to represent infinite models. For in- 
stance, the relation Circles defined in Example 1 represents for each circle the set 
of points belonging to it, and such a set is infinite. In the same way, infinite sets 
of positive and negative examples can be expressed and such examples have not 
to be ground. Let us also notice that some variables of the relation may not ap- 
pear in every tuple. For instance, the two tuples < true, true >, < true, false > 
of a binary relation can be viewed as the generalized tuple X = true. 

4 Our System 

4.1 The Learning Algorithm 

Our system uses a divide-and-conquer method, as the one used in the system 
Foil m-- it learns a generalized relation covering some positive tuples, it removes 
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the covered positive tuples, and the process is repeated until no more positive 
examples remains. Let POS and NEG respectively represent the positive and 
negative generalized tuples and let p{Rc) represents a refinement operator, which 
takes as input a relation Rc and which gives a set of relations more specific than 

Rc- 

Algorithm. 

1. Rh = 0 

2. POS = positive tuples of Rt 

3. NEG = negative tuples of Rt 

4. While POS yf 0 

(a) Rc = 0 

(b) While cov{Rc,NEG) yf 0 

i. Rc = best relation in p{Rc) 

(c) POS = POS - cov{Rc, POS) 

(d) Rh = Rh U Rc 

The main steps in that algorithm are the computation of p{Rn) and the 
choice of the best relation. 

Definition (Refinement operator). Let TZhe a, set of generalized relations, 
that are supposed to be over set of variables that are disjoint (Vi?, S G TZ, R ^ 
S —!■ var(R) n var(S) = 0). Let R GTZ. 

p{R) = {Raids' G TZ, C var{S),3aZ —t var{R), i?' = i? N a. S'} 

In the following, TZ is composed of 

— background generalized relations provided by the user before learning, 

— or domain specific relations that are induced, for instance, for expressing a 
constraint between variables. 

In the current implementation, symbolic and numeric domains are ordered and 
we search for constraints X < 9 for each variable X occurring in the relation. 
Moreover, we search for linear inequalities between numeric variables, as done 
in the system ICC 0. 

Let us insist on the fact that Constraint Databases truly provide a uniform way 
to deal with these different relations. For example, if we need to introduce a 
constraint, as for instance A < 2, to discriminate between positive and negative 
tuples, such a constraint can be viewed as a generalized relation over the variable 
X defined by only one generalized tuple, namely X < 2. A linear constraint 
between variables X\, . . . , A„, can be represented by a generalized relation 
over variables X\, . . . , A„ with a unique generalized tuple, namely, that linear 
constraint. 

Property. Let i? be a generalized relation. Vi?' G p{R), R' < R (i?' is more 
specific than R). 
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Example 5. Let us illustrate this process by a very simple example. Let us 
consider the target relation Rt, defined on the variables X, Y and Z by: 



(+) 


A = 0AF = 0AZ = 0 


(-) 


A = -1AF = 0AZ = 0 


(+) 


A= lAF = 0AZ = 0 


(-) 


A = 3AA = 0AZ = 0 


(+) 


X = 2AY = QAZ= 1 







Let us suppose also that we have the background relation dec{A^ B), defined 
by the generalized tuple: B = A — 1. 

At the beginning of the learning process, the hypothesis relation Rh is empty. 
In our simple example, there is only one possible relation to join to Rh, namely 
dec{A, B). There are many possible bindings, as for instance A = X A B = Y, 
or A = XAB = Z. Let us suppose that we choose the second binding. Then, 
the hypothesis Rh is set to dec{X,Z). In that case, since the target relation is 
grounded (each generalized tuple has a unique solution), an example e of Rt is 
covered by Rh, when eAZ = X — 1 is satisfiable. Only the two last positive 
examples of Rt are covered and there remains a negative example (the first one) . 
We get the following set of positive and negative examples. 



(+) 


A= lAF = 0AZ = 0 


(-) 


A = -1AF = 0AZ = 0 


(+) 


X = 2AY = QAZ= 1 







After variable bindings, the system searches for a constraint that would en- 
able to discriminate positive from negative examples. A simple constraint would 
be A > 0. The target relation Rh is set to dec{X, Z) A A > 0, it is defined by 
the only generalized tuple Z = A — 1AA>0. The two positive examples are 
covered (for instance, the first one is covered since A=lAA = 0AZ = 0AZ = 
A— lAA>0)is satisfiable); the negative example is no longer covered. 

The hypothesis that has been learned for our concept is therefore: dec{X, Z) A 
A > 0. It does not cover the first positive example and therefore if we want the 
learned definition to be complete, a new hypothesis must be learned for covering 
it. 



Best candidate relation. For the time being, we have chosen a strategy, close to 
that developed in the system Foil: a divide-and-conquer strategy to cover all the 
positive examples and a refinement process to build a good hypothesis covering 
as many positive examples as possible and rejecting all the negative examples. 
This refinement process is based on different choices: a relation, a binding and a 
constraint. To achieve such choices, different measures have been introduced in 
ILP, taking into account the number of positive and negative examples covered. 
In our framework, the number of positive and negative generalized tuples that 
are covered can be counted. Nevertheless, a generalized tuple can represent an 
infinite set of examples and counting the number of generalized tuples may be 
a bad approximation. For the time being, to overcome this problem, we have 
implemented a semi-automatic choice procedure: 
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1. the interest of each relation is computed with the formula 

(n^ + n~)jn 

where ri^ (resp. n~ ) denotes the number of positive (respectively negative) 
generalized tuples of the target relation that are covered (respectively un- 
covered) by the current hypothesis, and n is the total number of generalized 
tuples in the target relation, 

2. the user chooses among the relations that have a high interest. 

4.2 Efficiency 

Biases. In order to reduce the number of possible variable bindings, two classical 
ILP biases have been implemented: 

- typing and, 

— usage declaration: the system reduces the hypothesis space by using user 
supplied information about the mode of the variables, which is given when 
specifying the background relations: 

(-I-) indicates that this position must contain a variable already present in 
the relation, 

(— ) indicates that this position must contain a variable that is not already 
present. 

For instance, the declaration add{+X : integer, +Y : integer, —Z : integer) 
specifies for the generalized relation add over the variables X, Y and Z 
that, when introduced in a hypothesis, X and Y must already occur in that 
hypothesis. 

Speed-up. For some domains and some kinds of constraints, the efficiency of 
the learning system can be improved by normalizing tuples, that is to say by 
rewriting tuples into a unique, canonical form. Such a process allows both to 
speed-up the constraints satisfiability test and to remove redundant constraints. 
For instance, have proposed a canonical representation of tuples for dense 
order constraints ( dense order inequality constraints are formulas of the form 
xOy and xdc where x, y are variables, c is a constant, and 0 is one of =, <, <). 
This representation is based on the storage of a higher bound and of a lower 
bound for each variable of the generalized tuple. 

In our system, numeric and symbolic domains are ordered and to each con- 
straint is associated a higher and a lower bound. This allows a very fast induction 
process for finding interesting thresholds between positive and negative tuples. 

5 Preliminary Results 

We are currently testing our system on a dataset, which had been generated 
to test the system GKS an ILP system developed for the induction of 
constraint logic programs. 
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The dataset provides symbolic and numerical information about spatial lay- 
out of Japanese houses, e.g. the size of the rooms, their spatial relations or the 
total size of the house. The background knowledge is composed of both symbolic 
and numeric information. 

We have transformed the dataset into a constraint database. For instance, 
the relation east-of that indicates the relative position of two rooms in a house 
is described as follows ( a disjunction of conjunctions ): 

east — of{ + house : integer, room! : room, room2 : room ) 

[house= 1 Arooml =living-diningAroom2=kitchen] 

[house= 1 Arooml=toilet Aroom2=bath] 
[house=lArooml=entranceAroom2=hall] 

The complete database contains about 85000 tuples. 

Let us consider, for instance, the predicate corner(H, living, south-east) 
meaning that in the house represented by H, the living room is located in 
the direction south-east. Such a concept has 32 (tuple-) examples which are all 
grounded. At the beginning, the system builds some hypotheses like: 





Number of 


Number of 


Hypothesis 


positive examples 


negative examples 




covered 


rejected 


south-of(iJ, i?ii, i?i 2 ) N {i?i 2 = living} 


1 


all 


south-of(iJ, i?ii, i?i 2 ) N {i?ii = living} 


16 


4 



Since background knowledge contains relations involving real variables, tuples 
are stored in a canonical form, expressing the minimum and the maximum values 
of each real variable. At any time, the system performs a search for interesting 
thresholds, discriminating positive from negative examples. This enables to learn 
the following rules which are identical to those induced by the GKS system {ent- 
direction represents the direction of the entrance) : 



Induced rule 


© covered 


0 rejected 


area(iJ,A) N {A < 58.58} 


3 


all 


ent-direction(iJ, D) Narea(iJ, A) N { A< 60. 43 AD = west} 


1 3 


all 



We have started a systematic comparison of the induced rules. 

6 Conclusion 

We have proposed a new framework for learning, expressed in terms of Constraint 
Databases, that generalizes the classical ILP framework in several directions: 
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— it allows the representation of infinite sets of examples by means of con- 
straints, 

— it is expressed in terms of Databases and therefore it can been straightfor- 
wardly applied to Data Mining tasks, 

— it can be applied to spatial databases. 

Nevertheless, the system that we have developed must be improved: 

The form of the gain. As already explained, our measure has one main drawback: 
it does not take into account the fact that a generalized tuple can cover an infinite 
set of examples. To deal with this problem, new measures must be developed, 
for instance based on the size of the example space, represented by a generalized 
tuple. 

Biases. New biases must be implemented to reduce the search space. 

Induction of constraints. For the time being, the constraints that are induced 
are mostly thresolds for numeric and symbolic variables (let us recall that an 
order is given even on symbolic domains), and linear inequalities. The system 
must be extended to induce other kinds of constraints. 

Spatial Databases. The system is currently tested on classical ILP problems. It 
must be tested on learning in Spatial Databases, that cannot be directly handled 
by ILP systems. 
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Abstract. A Method for finding areas with the highest risks of near- 
future earthquakes, from data of observed past earthquakes, have been 
desired. The presented Fatal Fault Finder (F^) finds risky active faults 
by applying KeyGraph, which was presented as a document indexing al- 
gorithm, to a sequence of focal faults of earthquakes in stead of a docu- 
ment. This strategy of F^ is supported by analogies between a document 
and an earthquake sequence: The occurrences of words in a document 
and of earthquakes in a sequence have common causal structures, and 
KeyGraph previously indexed documents taking advantage of the causal 
structure in a document. As an effect, risky faults are obtained from an 
earthquake sequence, in a similar manner as keywords are obtained from 
a document, by KeyGraph. The empirically obtained risky faults by F^ 
corresponded finely with real earthquake occurrences and seismologists’ 
risk estimations. 



1 Introduction 

An earthquake occurs at an active /aztZt0, a boundary of two differently mov- 
ing areas of land crust, which is stressed by those movements (see Fig. By 
trenching survey, i.e. mining a fault for evidences of past earthquakes pp, we 
can approximately estimate the risk i.e., whether we have a long time before 
the next earthquake at the fault. Such a direct land-mining is inefficient and 
expensive, and more significantly can not distinguish near-future (within a few 
tens of years) earthquake risks from far future ones. 

Mining data, in spite of land, for finding risky active faults is inexpensive 
and efficient. Seismic gaps, areas recent earthquakes did not occur in but only 
around recently, may be regarded as risky |2]. However, a seismic gap may have no 
cause of earthquakes by nature. In |3l4j . earthquake risks at future moments of an 
active fault were estimated from its recorded history of earthquakes. However, for 
high-accuracy predictions, interactions among faults should be also considered, 

^ In this paper, troughs (seismic faults deep in the sea) and active faults in shallower 
land crust are both called active faults or faults simply. 
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which is quite difficult because underground interactions among faults are hidden 
and very complex. 

In this paper, we present a method to find risky active faults where big 
earthquakes may occur in a near future, with detecting hidden causalities among 
land crust movements and activities of faults. Hereafter, we call our system Fatal 
Fault Finder or F^ in short. 

F^ finds risky faults by applying KeyGraph which was presented as a docu- 
ment indexing algorithm |^, not to a document but to a sequence of focal faults 
0 of past earthquakes. This strategy comes from the idea that KeyGraph is es- 
sentially an algorithm for extracting causal structures from a sequence of events: 
Finding keywords carrying assertions of a document is just one application of 
KeyGraph. That is, the cause (land crust movements) and the effect (stresses on 
active faults, to cause near-future earthquakes) is extracted from an earthquake 
sequence by KeyGraph, as well as the cause (basic concepts of the author) and 
the effect (keywords expressing author’s assertions) from a document. 

The remainder goes as follows: In section the mechanism outline of F^ is 
described. Then, the translation from an earthquake sequence to a sequence of 
focal faults is shown in section 0 Applying KeyGraph to thus obtained sequence 
is shown in section 01 of which the performance of finding risky faults from real 
earthquake data in Japan is presented in section 

2 The Outline of Fatal Fault Finder 

The Fatal Fault Finder {F^) is a system following the three-steps procedure as: 

1 ) Accept the following input data: 

Data 1: 2-dimensional locations of faults, on the land surface. 

Data 2: A sequence of earthquakes, i.e., the time, the place, the depth, and 
the magnitude of each past earthquake. 

2) Make D, the sequence of focal faults of earthquakes in Data 2, divided by 
periods (“.”) corresponding to the moments of major changes in the land 
crust movement, using Data 1 (details are in sectional. 

3) Obtain strongly-stressed active faults from D with KeyGraph, and regard 
those faults as risky (details are in section gj . 

In the next two sections, let us show the details of the procedure above. 

3 Obtain a Sequence of Focal Faults from Data 2 

Data 2 in section|^is made of Ei defined as {timci, longitudci, latitudci, depthi, 
magnitutei), for i = 1,2,. ..TV sorted by timci (time of the i th earthquake) 
where N is the number of observed earthquakes. (longitudei,latitudei), depthi, 
and magnitutei denote the 2-dimensional place, the depth, and the magnitude of 

^ Following the term focus as the underground point where an earthquake occurred, 
we say a focal fault to mean the fault where an earthquake occurred. 
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Fig. 1. Land crust areas pressing boundary faults: Small faults shown with thin 
lines quake with the movements of their belonging areas, and thick-lined faults 
are stressed between forces (thick arrows) under the areas. 



the i th earthquake respectively. The sequence of the focal faults of earthquakes 
in Data 2 is obtained, by the following two steps. 

1) Obtain the focal fault of each earthquake: 

For selecting the focal fault of each earthquake, we use a prepared datum 
Ap = (a, b) for every fault Ap vci Data 1 (in section 0, where a and b are 
the positions of the two ends oi Ap defined 2-dimensionally by the longitude 
and the latitude. The third dimension, i.e., the depth of each earthquake, 
is useless because it is not known seismologically where each fault lies un- 
derground. We define the distance from the epicenter (the 2-dimensional 
position on the ground surface of the earthquake focus) a: to a fault Ap 
(= (a, 6)) in Eq.(0l, and regard Ap as the focal fault if distance{x, a, b) is of 
the least value of all the faults in Data 1. Eq. m is formed to reflect the fact 
that epicenters of earthquakes occurring at a long fault vary in a wide area 
around the fault line, which is the part of a fault to appear on the ground 
surface. Note that x, a, and b are two dimensional vectors. 

distance{x, a, 6) = |a; — a| -I- |a: — &| — |a — 6|. (1) 

2) Insert a after each earthquake stronger than Mg: 

Here, Mg is a fixed magnitude value. Energetic earthquakes become the 
causes of major changes in the movement of land crust , by releasing crust- 
distortion energy. Therefore, a moment between two nearest “.”s means a 
term while the forces moving the land crust do not change radically. 
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For example, from Data 2 in Tabled we obtain a string sequence D with spaces 
and periods (“.”s), as in Eq|2| including N “f’s. 

D = 12311 202jl 84H. Ifl 76H IH- 216jl 202jl 84#. 249H 84jl. 7611 249H IH • • • (2) 

Here, “mH” means the m th fault. Here, 123H and 202H are faults No. 123 and 
No. 202 prepared in Data 1, which are the closest ones to the epicenters of the first 
and the second earthquakes, i.e., (142.804E, 42.140N) and (139.523E, 37.441N) 
respectively. This example includes three “.”s supposing that three big earth- 
quakes over magnitude Mq occurred. The uncertainty in the time of earthquake 
occurrences j0| influences D little, because D has only the order of timci. 



Table 1. An example of Data 2 



i 


timci 


longitudci 


latitudei 


depthi 


magnitutei 


1 


85 07 01 01:17:52.24 


142.804E 


42.140N 


10.2 


2.1 


2 


85 07 01 03:01:49.92 


139.523E 


37.441N 


158.7 


3.1 














N 


85 07 01 19:51:53.45 


138.469E 


36.992N 


12.9 


2.2 



4 Applying KeyGraph to Earthquake Predictions 

4.1 Earthquake Mechanisms Considered 

The sources of earthquakes are underground forces, each causing an area of land 
crust to move, and faults in the area to quake with the land crust movements. 
Assuming that earthquakes in the same term (i.e., between a pair of nearest “.”s) 
are caused under the same set of forces (as in 2) of section]^, it is quite possible 
to consider that faults which often co-quake, i.e. quake in the same terms, are 
moved by the same force. In other words, they are quite possibly in the same 
land crust area which is moved by a common force. 

Especially when such land crust areas move fast (so that faults in them quake 
frequently), the stress from those areas to the boundary fault tends to cause a 
big earthquake in a near future (see FigQ. In fact, active fault provinces 0 are 
separated by boundaries, some of which are risky active faults including Nojima 
fault (see section]^. These provinces are land crust areas we mean, except that 
the positions of their boundaries are fixed. The stress here causes the faults in the 
land crust areas and the boundary fault to co-quake, before the big earthquake. 

4.2 Applying KeyGraph to a Focal Fault Sequence 

Reflecting the earthquake mechanism above, we apply the following steps 1), 2) 
and 3), to the focal-fault sequence D obtained as in section]^ In the over all 
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process, a set of faults, which appear as a fixed sequence in D more than twice, 
is dealt with as one fault rather than separate faults (see the caption of Figil). 

1) Regard co-quaking faults as existing in the same land crust area: 

Graph G is first made of black nodes in Fig. 0 representing a fixed number 
(set to 30, following KeyGraph in 0 to which we refer later) of the most 
frequently appearing faults in D. These frequently quaking faults are taken 
as in fast-moving land crust areas. Then, a fixed number (set to 29) of the 
most co-quaking fault-pairs, i.e. faults Wi and wj of the largest coq{wi,Wj) 
in Eq. o, are linked by solid lines in Fig. El 

coq{w^,Wj) = ( 3 ) 

seD 

where |a;|s denotes the count of x’s appearances in term s. 




Fig. 2. Faults in a land crust area are obtained as a cluster {ga or gi,), which 
stresses fault wq via the thick dotted arrows. 



Then, a maximal connected subgraph (a connected subgraph included by no 
connected subgraph) of G is regarded as a cluster of faults in a land crust 
area. For example, two clusters are obtained from D in Eq. ( 0 , one including 
fault-set {84^, 202'i} and the other including {76j), f j)}. 

2) Estimate the stress to a fault from an adjacent land crust area: 

The strength in which a fault w is stressed by a land crust area for cluster 
g, i.e., the area including faults in g, is defined by stress{w, g) in Eq.(Q. 
Eq. @) means the co-quakes of the faults in cluster g with w, divided by the 
co-quakes of fault g with all the faults in D for discounting earthquakes by 
the stress between g and other faults than w. 



stress{w, g) 



Y.seP \9 -w\s 

k'L \9~w'\,' 



( 4 ) 



Here, |g|s is the count of faults in g quaking in term s, and 



f IsL - \Ms Gg, 

llffL ifw^g. 



\g - w| 



( 5 ) 
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For example, from D in Eq. (0, 249H is stressed strongly by the two land 
crust areas (one with {84^, 202^} and another with {76jl, 

3) Regard faults stressed strongly by land crust areas as risky: Faults 
(w’s) of the highest key{w), defined as the sum of stress{w, g) for all clus- 
ters (g’s) in G, are obtained as the most risky faults as wq in Fig. 0 This 
is because these faults are stressed from adjacent faults, and are already 
beginning to quake to be recorded in Data 2. These observed quakes should 
be interpreted as a short-term pre-process leading to a big earthquake. 249jl 
is taken as risky from D in Eq. ( 0 . 

The procedure above can be obtained from KeyGraph 0, a method for ex- 
tracting keywords representing author’s assertions in a document, by the trans- 
lations from the left to the right in Table 0 The strategy of KeyGraph, i.e., 
extracting author’s new assertions supported by his/her basic concepts, is inher- 
ited to which finds new earthquake-risks at faults stressed by basic land crust 
forces. As well as catching the content-structure of a document based on the co- 
occurrences of words in the document, KeyGraph catches the causal structure of 
land crusts’ and faults’ activities based on the co-quakes of faults in F^. 



Table 2. Analogy of a document and an earthquake sequence 





Document 


Earthquakes 


(1) 


a word 


the focal fault of an earthquake 


(2) 


a period (“.”): the end of 
a sentence 


the moment of a major change in 
the movements of land crusts 


(3) 


a basic concept of the author 


the common force in a area of land crust 


(4) 


Words co-occur for describing 
a component basic concept. 


Faults co-quake by common 
underground forces. 


(5) 


Basic concepts supports an assertion, 
which KeyGraph extracts as keywords. 


Stresses from land crusts cause 
a fault to quake. 



Note that the co-occurrence of words is defined by the sheer number of oc- 
currences of a pair of words in same sentences in KeyGraph, but achieved good 
performances B0 due to modeling the structure of a document’s content. A 
rough model of the underlying structure as in KeyGraph is helpful for dealing 
with earthquakes, because the details of the underground interaction of faults is 
unknown (even adjacent faults can be ruled by different underground forces), as 
well as interactions among basic concepts in a document author’s mind. 

5 Experimental Evaluations 

We tested F^, implemented in a DOS/V machine with a Pentium Pro 200MHz 
CPU and 128MB. We got 311 representative active faults in Japan as Data 1, 
and data of earthquakes with contents as in Table 0as Data 2. 
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We set Me to 4.0, for which worked best. For Mg of a less value, too 
many ”.”s in the focal fault sequence divided graph G (in section n into too 
small clusters corresponding no meaningful areas of land crust. For a larger Mg, 
clusters were too large. 




Fig. 3. The Kansai area. The lines with numbers are active faults. Numbers in 
black angles are referred to in Fig. 4. 



First, let us show an example of the performance of for the Kansai area of 
Japan, of longitude from E132.0 to E136.0 and of latitude from N32.5 to N36.0 
(all the area in Fig. [^). Here, Data 2 was taken from the Japanese University 
Network Earthquake Catalogue (JUNEC), of earthquakes in 1985 of magnitude 
M > 2 in this area. In Fig. active faults are shown by the numbered lines. 

Figgis the output of F^ when Data 2 had earthquakes in year 1985 in Kansai 
area. Here, active fault No. 39 was obtained as risky. No. 39 is Nojima fault, at 
which the Southern Hyogo earthquake of Ml. 2 occurred in 1995 (more than 
6000 people were victimized). This fault was also selected to be the most risky, 
from the sequence of earthquakes from 1985 to 1994 in Kansai area in JUNEC. 
We can say F^ predicted the Southern Hyogo earthquake, because Data 2 for 
this case was recorded before that earthquake. The median tectonic line ((A) in 
Fig.0, one of the riskiest in Japan was also obtained as risky. 
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Fig. 4. The result of for Kansai area. obtained the double circle nodes 
as risky faults. Thick solid lines and black nodes form clusters (some clusters 
include only one node, i.e. one fault), and other lines show stresses from clusters 
to faults. Thick dotted lines are the strongest stresses. A set of faults appearing 
as a fixed sequence in D more than twice appears here as one fault, e.g., “25, 
76” and “56, 73.” 



Another point in Fig. ^ is that we can see the mechanism by which a risky 
fault is stressed. For example, fault No. 39 is stressed by No. 24 and 34 (each 
node forming one cluster) and the large bold-lined cluster (of faults as No. 76, 
79, etc.). From Fig0 the result in Figgis interpreted that No. 39 was stressed 
between its northern (No. 24 and 34) and southern crust areas (No. 76, 79, etc), 
as seismologists prevalently believe. 

Comparing Fig^ with Fig|^ shows reader the correspondence of applying 
KeyGraph to a focal fault sequence and to a document. Fig^ depicts the ba- 
sic concepts (solid lines and black nodes), the supports of assertions by basic 
concepts (dotted lines), and assertions (white double circles) in a document, ob- 
tained by KeyGraph. “speed”, “integer programming,” “approximate method,” 
and so on were selected in spite of their low frequencies (appeared only once or 
twice in the entire document of 6899 words), which were asserted in the docu- 
ment: The document was a paper presenting a method of speeding up predicate- 
logic abductive inference (an NP-complete problem) by applying an approximate 
method of integer programming. 
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The cluster in the center of Fig|S| shows that “networked bubble propaga- 
tion” ( “NBP” ) which had been presented as a speeding up method of proposi- 
tional abduction, was a basic concept in the paper as well as “solution,” “time,” 
“hypotheses” etc. in black nodes. On these ideas, the paper asserted that an 
“approximate method” of “integer programming” can be a tool of speeding up 
predicate-logic abduction, not only propositional abduction. 




hypothetical reasoning 

Fig. 5. A result of KeyGraph: Here the white double-circle nodes are the as- 
sertions in the document, corresponding to risky faults in F^. Black nodes are 
in clusters i.e., basic concepts (some clusters are of one-word, e.g. “inference,” 
“hypotheses” ) corresponding to faults in one area of land crust in . Multiple 
words in a cluster are connected with solid links, as well as faults in F^. Note 
that basic concepts supporting assertions strongly are also depicted in double- 
circle (black) nodes, because they help in showing the abstract of the assertion 
in the case of a document. 



5.1 Results of Japanese Risky Faults 

(1) From Data 2 from 1985 to 1992 in JUNEC, the solid lined faults in 
each of the five frames in Fig. were obtained by F^ as the 14 % riskiest faults 
in the frame, from Data 2 of earthquakes which occurred in the framed area. 
The output figures, exemplified by FiggJ of F^ for these frames showed that 
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local crust areas smaller than each frame stressed the risky faults. On the other 
hand, faults with the dotted lines in Fig. 0were obtained as the 14 % riskiest 
faults in Japan, from Data 2 of earthquakes which occurred in all over Japan. 
The figures for this case showed that large crust areas stress the risky faults. 
This output difference between the solid and the dotted lines comes from the 
activity localities, i.e. local (solid lines) and global (dotted lines) crust activities, 
rather than the strength of stresses. 

In Fig. 1^ the thicker line (fault) was obtained as the riskier, i.e., obtained 
the larger key value of KeyGraph in F^. The circles in Fig. ^depict the greatest 
10 earthquakes after or near the years while Data 2 was recorded. The nearest 
faults to all (100 %) these epicenters were selected by as risky. We restricted 
data to earthquakes before 1992 here, for showing that F^ did predictions. 

(2) Fhom Data 2 from 1985 to 1992 in the earthquake catalog of JMA 
(Japan Meteorological Agency), 89 % of the risky faults by F^ (14 % se- 
lected for each frame and 14 % for all over Japan as in the JUNEC case above) 
corresponded to the faults in Fig|^ The data of JMA was gathered from ob- 
servation points more homogeneously distributed in the overall region of Japan 
than JUNEC. This correspondence of both results of JMA and JUNEC, in spite 
of the difference in the way of data gathering, shows that the causal structure 
of earthquakes is reflected to KeyGraph regardless of the quality of data. 

Faults in the intersection of outputs in (1) and (2) included the 12 riskiest 
faults (areas (A)-(L) in Fig|^ to which seismologists pay attentions e.g. in I1UI9I . 
Further, believing in all the results of F^ except only one probable error pointed 
by staffs in ERI (Earthquake Research Institute of Univ. Tokyo), the precision 
of risky faults obtained by F^ is 98 % (50/51). The “only probable error” here 
was as shown in the thick dotted arrow with “X” in FigEl The sea-side fault in 
Fukushima prefecture, at the arrowhead, was taken as risky although seismol- 
ogists regard this area as safe. The reason for this failure of F^ was that some 
earthquakes, whose epicenters are near this fault on the land surface, were taken 
as caused by this fault. The fact, on the other hand, was that they occurred at 
the trough (active area deep in the sea) at the tail of the arrow: This trough is 
continued to quite a deep underground of Fukushima prefecture. 

It may be possible to reduce such failures, by restricting our target to only 
shallow active faults to ignore deep earthquakes at troughs. However, in this 
approach, we have to give up some good results. For example, area (L) in Fig.0 
(South-Kanto area) is said to be possible to quake greatly by seismologists, and 
the trough in area (L) was judged risky by F^ . If we ignore deep earthquakes, 
earthquakes in area (L) will thrown away from Data 2 and (L) becomes elimi- 
nated from the risky areas of F^ . For avoiding both errors, we have to consider 
the crust structures in the deep level under the ground. However, unfortunately, 
the deep underground structure of most faults are unknown. 

Prom Data 2 after 1993 of JMA (JUNEC have no data after 1993), 
results of F^ changed mainly in Kansai-area. Nojima fault ((B) in Fig.KGjapan) 
was the riskiest in Kansai until 1995. After 1995, the year of the Southern Hyogo 
earthquake, faults in the north extension of Nojima fault (e.g. No. 2 in the right- 
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edge of FiQ) came to be obtained as risky. We cannot refer to past earthquakes 
to evaluate this result because may be showing future risks, but in fact the 
southern Hanaore fault (No. 2) has long been stressed m 



1993 




' 1.1994.10 

7^,1 m 

#1-: ^ 1993.1 X 

1982.3 

‘ *1994.12 




Fig. 6. Risky faults by from the data of each framed area (solid lines) and of 
all over Japan (dotted lines). The shadowed areas are estimated risky according 
to seismologists. 



6 Conclusions 

F^ obtained more focused (on risky faults rather than areas including risky 
faults, and on near future risks rather than including far future risks ) predictions 
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of earthquakes than previous methods as introduced in section ^ The accuracy 
of our results therefore cannot be compared to other existing methods, but shows 
the remarkable performance of F^. These results are meaningful in that a simple 
computing algorithm, based on a simple model of causal structures as KeyGraph, 
worked finely for finding useful information from superficial data from complex 
natural events. 

We are not brave enough to say always predicts near-future great earth- 
quakes, because the model saying “an earthquake-sequence is like a document” 
in using KeyGraph may be much simpler than the real mechanism of unknown 
land crust movements. However, it is safe to recommend F^ as a helpful tool for 
aiding risk estimations, e.g. focusing seismologists’ attentions to faults which F^ 
regards as risky from easily-stored electronic data. 
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Abstract. This paper describes a new method of discovering clusters 
of related Web pages. By clustering Web pages and visualizing them in 
the form of graph, users can easily access to related pages. Since related 
Web pages are often referred from the same Web page, the number of co- 
occurrence of references in a search engine is used for discovering relation 
among pages. Two URLs are given to a search engine as keywords, and 
the value of the number of pages searched from both URLs divided by 
the number of pages searched from either URL, which is called Jaccard 
coefficient, is calculated as the criteria for evaluating the relation between 
the two URLs. The value is used for deciding the length of an edge in 
a graph so that vertices of related pages will be located close to each 
other. Our system based on the method succeeds in discovering clusters of 
various genres, although the system does not interpret the contents of the 
pages. The method of calculating Jaccard coefficient is easily processed 
by computer systems, and it is suitable for the discovery from the data 
acquired through the internet. 



1 Introduction 

In order to acquire useful information from the vast network of World Wide 
Web, it is important to clarify the relation among Web pages. A set of Web 
pages can be represented as a graph when each Web page is regarded as a vertex 
and each hyperlink as an edge. This paper describes a method of discovering 
clusters of related Web pages based on the co-occurrence of references. Two 
URLs are given to a search engine as keywords, and the value of the number 
of pages searched from both URLs divided by the number of pages searched 
from either URL, which is called Jaccard coefficient, is calculated as the criteria 
for evaluating the relation between the two URLs. This is because related Web 
pages are often referred from the same Web page. In order to locate related 
Web pages close to each other on a graph, the Jaccard coefficient is used also 
for deciding the length of edges. Our system based on the Jaccard coefficient 
generates a graph which shows the relation among a set of Web pages. From the 
generated graph, several clusters of closely related pages are easily discovered, 
such as the clusters of Linux, music, and travel. Evaluation of relation by Jaccard 
coefficient can be regarded as a variation of collaborative information filtering. 
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From an abundance of data acquired through the internet, the system succeeds 
in discovering the clusters of pages without interpreting their contents. 



2 Related Work 

It is often pointed out that visualization is important for detecting relation 
among objects and for extracting implicit information. In the domain of WWW, 
graph representation of Web pages is expected to assist the discovery of related 
pages. Inxight 0 exhibits a system which visualizes the directories of a Web 
site by a hyperbolic tree. The system is used also in AltaVista Discovery|3 
which assists users’ search of Web pages. Homepagemap^^l is another example of 
visualization using a hyperbolic tree. Although these systems clarify the relation 
among directories of the same Web site, they have no abilities of showing the 
relation of Web pages of different Web sites. Visualization by graph structure is 
employed by Natto View^l], which allows its users to “lift up” arbitrary vertex 
in a graph. When a vertex is lifted up, other vertices which are linked with it 
also go upward so that the users can observe the connection among vertices. 
However, the system is inappropriate for grouping related Web pages on a graph 
since vertices can be moved upward only. 

In general, Web pages contain several forms of information such as docu- 
ments, images, and music. Understanding all of such contents and classifying 
the pages appropriately are not easy even for humans. In order to achieve such 
intelligent activities by a computer system, it is very important to utilize re- 
ferring relation of hyperlinks. For detecting such relation among Web pages, a 
search engine is a powerful tool. As an example of a visualization system using 
a search engine, Shibayama’s system filters out un-referred pages from the 
searched results of a few keyword and shows the remaining pages in the form of 
graph. However, the length of the edge in the graph does not reflect its strength 
of the relation. Edges connecting closely-related vertices should be short so that 
the connected vertices will be located closer. 

Humans often use visualized representation for extracting implicit informa- 
tion and for controlling reasoning. Visualized information has the properties that 
assist humans’ intelligent activities; related data are often grouped together in 
a figure(locality)0, and a figure often enables the detection of visual informa- 
tion such as neighborhood relation and relative size (emergent property) 0. 
These properties are often mentioned in many researches, and these are part of 
the reasons humans use figures for reasoning and problem solving. Sumi’s Com- 
munication Support System [T5| is one of the examples that utilize the above 
properties. The system generates a map of words on a two-dimensional plane. In 
the map, strongly related words are placed closer so that it facilitates commu- 
nications among users. In the domain of WWW also, by locating related Web 
pages closer on a graph, users can access related Web pages easily and acquire 
needed information efficiently. 
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3 A Method of Discovery Based on the Co-occurrence of 
References 

The structure of the system described in this paper is shown in figure Q The 
input to the system is the URLs of user’s favorite Web pages in the form of 
bookmark file, and the output from the system is a graph showing the relation 
of the Web pages. The system performs the following four tasks: 

— Acquisition of HTML files 

— Extraction of Hyperlinks 

— Calculation of Jaccard Coefficient 

— Graph Drawing 



input: 

a bookmark file 



I 



visualiza^pn 
system 





access to URLs 



HTML files 



Internet 



Web Pages 




search 






Search 

engine 




output: a graph # of references 







Fig. 1. Overall structure of our system 



3.1 Acquisition of HTML Files 

From the input bookmark file, the system accesses to URLs which are included in 
the file in order to acquire corresponding HTML files. If an included URL ends 
with “/”, such as ‘'http://www.asahi.com/’’, index.html file is acquired 
from the URL. 

3.2 Extracting Hyperlinks 

From the acquired HTML files, the system extracts hyperlinks, which are divided 
into two types: 
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- href=“URL” 

(Example: <a href=‘ ‘http://www.yahoo.co. jp’ ’>) 

— href= “filename” 

(Example: <a href=‘ ‘/english/english.html’ ’>) 

The latter type of hyperlinks are used mainly for specifying files in the same 
Web site. Since our system aims at visualizing pages from different Web sites, 
the former type of hyperlinks are extracted for graph drawing. 



3.3 Calculation of Jaccard Coefficient 

In order to decide the length of edges in a graph, a method which is similar 
to referral’s technique^ is employed in this paper. REFERRAL visualizes 
researchers’ social network. The system judges the strength of the relation of 
two researchers by Jaccard coefficient, which is the number of pages indexed 
by AltaVista that contain both names divided by the number of pages that 
contain either name. REFERRAL draws graph of the relation whose Jaccard 
coefficient is more than a threshold value. This method utilizes the number of 
co-occurrences of references in a search engine as the criteria for evaluating the 
relation between researchers. The method is easily processed by computers, and 
is applicable to various domains. 

In our method, Jaccard coefficient of two URLs is defined as the number of 
pages indexed by a search engine that contain both URLs divided by the number 
of pages that contain either URL. Although REFERRAL uses Jaccard coefficient 
just for removing unimportant relation, our method calculates the length of edges 
from the value of Jaccard coefficient in order to show the strength of relation on a 
graph. The length of edges is defined as the reciprocal of Jaccard coefficient. For 
example, suppose the number of pages indexed by a search engine that contain 
WWW . asahi . com is 19797, the number of pages that contain www . nikkei . co . jp is 
43365, and the number of pages that contain both is 1773. Jaccard coefficient of 
WWW . asahi . com and www. nikkei .co.jp is 1773/(19797-1-43365), and the length 
of the edge connecting the two URLs is (19797-|-43365)/1773. As a search engine 
for calculating Jaccard coefficient. Goo (http://www.goo.ne.jp) is employed 
in this paper. Since such robot-based search engines have much data, they are 
suitable for the search of co-occurrence of URLs. 



3.4 Graph Drawing 

As the method of drawing graph from the data of vertices and edges, Force- 
Direct Placement ^ is employed in this paper. The basic idea of this method is 
to replace each vertex with a steel ring and replace each edge with a spring to 
form a mechanical system. The vertices are placed in some initial layout and let 
go so that the spring forces on the rings move the system to a minimal energy 
state. Repulsive forces are calculated between pairs of neighboring vertices, and 
attractive forces are calculated between pairs of vertices which are connected by 
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an edge. Each vertex are moved according to the direction of applied resultant 
forces until all the vertices become stable. 

Let us suppose that the length of each edge Cj is len{ej) and the edge ej 
expands or contracts toward its optimum length Ij . The strength of repulsive or 
attractive force from an edge ej to a vertex Vi is defined as follows: 



\ie{vi,ej) \ = ki\len{ej) - lj\ 

The sum of forces from all the edges connecting to a vertex Vi is the resultant 
force from edges applied to a vertex vf. 

Fe('Ci) — ^ ^ fe ; Cj ) 

3 

To speed up the algorithm of force calculation, repulsive forces are neglected 
between pairs of vertices whose distance is more than d. Suppose the distance 
between vertices Vi and Vj is dist{vi,Vj). The strength of repulsive force applied 
to both vertices is defined as follows: 

{ random (dist{vi,Vj) = 0) 

k2/dist{vi,Vj) (0 < dist{vi,Vj) < d) 

0 {d < dist{vi,Vj)) 

The sum of forces from all other vertices Vj{j yf i) is the resultant force from 
vertices applied to a vertex Vi\ 

3 



The sum of the above both forces: 



F(wi) = Fe(Wi) + Fv(Wi) 

is the force applied to a vertex Vi. Each vertex is moved toward the direction of 
applied force until the graph becomes stable. 

4 Experimental Results 

According to the method described above, our system is developed using Java 
(JDK 1.1.6). As the input bookmark files for the system, the following collections 
of representative Web pages are used: 

— Award sites of CSJ Index (Japanese) 

(http : / / WWW . cs j . CO . j p/ whatsbest/award . html) 

— Best 100 sites by ZDNet (Japanese) 

(http : //www. zdnet .co.jp/ internet/yig/bestlOO/ index.html) 

— Sites from 100hot.com 

(http : //www. lOOhot . com/home . chtml) 
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Fig. 2. Visualization using Jaccard coefficient 



Figure 0 shows the result of the visualization of CSJ award pages using 
Jaccard coefficient. From the figure, the following clusters of related Web pages 
are discovered: 

A cluster of Linux pages (the lower right of the figure) 

— www.linux.or.jp 

— www.redhat.com 

— www.debian.org 

— www.debian.or.jp 
— WWW .pht .co.jp 

A cluster of fashion pages (the upper center of the figure) 

— www.isokai.co.jp 

— triumphjapan.com 

— www.aubade.com 

A cluster of travel pages (the middle right of the figure) 

— www.arukikata.co.jp 

— www.appleworld.com 

A cluster of music pages (the upper left of the figure) 

— www.komuro.com 

— www.amuro.com 

— www.tomomi.com 
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— www.tktour.com 

— www.area-globe.com 

In the figure, clusters of Web pages of Apple site and Yahoo site are also 
found. From the graph of ZDNet pages, the following clusters are discovered: 

A cluster of weather pages 

— www.kishou.go.jp 

— tenki.or.jp 

— www.jwa.or.jp 

A cluster of the pages of related company or organization 

— www.sony.co.jp 
— WWW . scei .co.jp 
— WWW . sme .co.jp 

— www.japan-music.or.jp 

From the graph of lOOhot pages, the following clusters are discovered: 

A cluster of news pages 

— www.washingtonpost.com 

— dailynews.ycihoo.com 

— www.nypostonline.com 
— WWW . abcnews . com 

— www.newsweek.com 
A cluster of Java pages 

— www.javaworld.com 

— java. sun. com 
— WWW .jars . com 
— WWW . sun. com 





# of Web pages 


# of vertices 


# of edges 


# of search 


CSJ 


165 


152 


219 


2109 


ZDNet 


104 


142 


223 


2514 


lOOhot 


129 


180 


160 


2360 



Table 1. Data of Generated Graphs 



If a system tries to interpret the contents of the Web pages, it is not easy 
to discover such clusters. This successful result shows that the method of calcu- 
lating Jaccard coefficient using a search engine is quite effective in discovering 
relation among pages. The number of Web pages, vertices, edges, and search for 
generating each graph is shown in tabled Distribution of the number of vertices 
in each cluster is shown in table 0 table 0 and table gj The average number of 
vertices is 4.3 in GSJ graph, 5.3 in ZDNet graph, and 4.5 in lOOhot graph. These 
sizes are appropriate for the browsing of related Web pages. 
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As shown in table El rather big cluster is also found in the generated graph 
(53 vertices). This is because the “meaning” of the edges are different within 
the cluster; some edges in the cluster represent the relation of PC vendors, while 
other edges represent the relation of PC magazines. Our system generates an 
edge when the Jaccard coefficient of two URLs is equal or more than 0.01. By 
changing this threshold value, our system can change the size of the clusters in 
a graph. 



# of vertices 


2 


3 


4 


7 


8 


9 


10 


12 


13 


18 


# of clusters 


18 


7 


2 


1 


1 


1 


2 


1 


1 


1 



Table 2. Distribution of the number of nodes in each cluster of CSJ graph 



# of vertices 


2 


3 


4 


5 


8 


10 


53 


# of clusters 


8 


11 


3 


2 


1 


1 


1 



Table 3. Distribution of the number of nodes in each cluster of ZDNet graph 



# of vertices 


2 


3 


4 


5 


6 


8 


10 


17 


24 


# of clusters 


20 


6 


3 


4 


1 


2 


1 


2 


1 



Table 4. Distribution of the number of nodes in each cluster of lOOhot graph 



In the previous studies of Web visualization, several attempts have been 
made for drawing a graph in three-dimensional space. Compared with a two- 
dimensional graph, more vertices can be represented in a three-dimensional 
graph. However, it is often pointed out that users find difficulty in understand- 
ing overall structure of a huge three-dimensional graph. In the experimentation 
mentioned above, the number of vertices generated in each graph is about 150. 
The size of generated graph is considered to be moderate when the graph is used 
for finding clusters of related Web pages and for understanding the tendency of 
all URLs in the input bookmark file. 

5 Discussion 

5.1 Co-occurrence and Semantic Relation 

In this paper, the number of co-occurrence of references in a search engine is 
regarded as the criteria for evaluating the relation between two URLs. One 
of the related work using such criteria is Craven’s ILP system^. A predicate 
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has_neighborhood_word, which represents the co-occurrence of hyperlinks and 
their neighboring words in Web documents, is included in the system’s back- 
ground knowledge. 

In general, few Web pages contain hyperlinks to unrelated URLs. On the 
other hand, related URLs often co-occur in the pages of link collection. It is 
probable that the more two URLs are related, the more they co-occur in Web 
pages. Therefore, it is appropriate to reflect the value of Jaccard coefficient on 
the length of edge in a graph. 

5.2 Information Filtering 

The process of selecting important hyperlinks based on Jaccard coefficient is re- 
garded as information filtering. There are two main paradigms|3 of information 
filtering for a given user: content-based filtering which tries to recommend items 
similar to those the given user has liked in the past, and collaborative filtering 
which identifies users whose tastes are similar to those of the given user and 
recommends items they have liked. 

Since the method using Jaccard coefficient performs filtering based on the 
relevance judged by each Web page builder, the method is regarded as a variation 
of collaborative filtering. Our filtering method is based on the relevance of Web 
pages rather than the similarity of tastes, and our method is not applicable 
to pages which are not referred from any other pages. However, our method 
has the advantages over ordinary information filtering methods; no input data is 
required in advance for investigating users’ tastes. The advantages of our method 
can be used to cancel the disadvantages of other filtering method. Therefore, our 
method is regarded as a new approach for Web page filtering. 

6 Conclusion 

A new method of discovering clusters of related Web pages is described in this 
paper. The method of evaluating relation of pages based on Jaccard coefficient 
is simple and powerful. The visualization system in this paper is important also 
for clarifying the role of diagrammatic representation for humans’ intelligent 
activities, such as learning and reasoning. 

The author has been working on the research of discovery systems for mathe- 
matical theorems Eniini. These systems draw figures and observe them in order 
to acquire data which are needed for their discovery. In the domain of mathe- 
matics, discovery systems can generate and acquire needed data by themselves. 
The same approach is available for the discovery of other domains because many 
systems nowadays are ready for accessing to external resources through the in- 
ternet. By the search of URLs using external search engine, the system described 
in this paper performs a kind of “experimentation” to acquire data which are 
needed for discovery. Our method based on Jaccard coefficient is a new approach 
for discovering knowledge from the internet. It is expected that data acquisition 
through the internet enable the discovery of knowledge in real-world domains. 
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Abstract. This article is addressed to the problem of modeling and ex- 
ploring time series with mean value structure of large scale time series 
data and time-space data. A smoothness priors modeling approach m 
is taken and applied to POS and GPS data. In this approach, the ob- 
served series are decomposed into several components each of which are 
expressed by smoothness priors models. In the analysis of POS and GPS 
data, various useful information were extracted by this decomposition, 
and result in some discoveries in these areas. 



1 Introduction 

In statistical information processing, introduction of the information criterion 
AIC P facilitated to compare statistical models freely and changed the conven- 
tional paradigm of statistical research which consisted of estimation and sta- 
tistical test. It reveals the importance of proper statistical modeling, and the 
use of parametric models become very popular since then AIC criterion 
suggests that if the available data are short, we have to use simpler model to 
obtain reliable information from that data. However, by the progress of various 
measuring devices, it becomes possible to use a huge amount of data in various 
fields of sciences and societies. In this situation, a more important problem is 
to extract useful information from a huge amount of data, but it is difficult to 
achieve by a simple parametric model. Namely, in this situation, modeling with 
small number of parameters is sometimes insufficient and a more flexible tool 
for extracting useful information from data is necessary. 

In an analysis of input-output relationship of econometric time series, Shiller 
PI introduced the notion of “smoothness priors” , and considered constrained 
least squares problem. A similar concept has already appeared in m addressing 
a problem of the estimation of a smooth trend. The trade-off parameters were 
determined subjectively until Akaike m proposed the method of choosing the 
priors (trade-off parameters), or hyperparameters in a Bayesian framework, by 
maximizing the likelihood of a Bayes model m- The calculation of the likeli- 
hood of the model requires intensive computation, of which burden Gersch and 
Kitagawa |Hj eased by employing a state space representation of the model and 
recursive algorithm of Kalman filtering 0. 
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In this paper, we will present applications of this smoothness priors approach 
for exploring large scale time series data or space-time data. Specifically, we 
consider the POS (Point of Sales scanner) data and GPS (Global Positioning 
System) data, because an automatic transaction of these data is one of the 
most attractive and potential targets in statistical science. By the analyses of 
these data, it will be shown that by removing trend and seasonal components 
by a proper smoothness prior modeling, useful information such as competitive 
relation (for POS data) and local fluctuation associated with an atmospheric 
condition (for GPS) are discovered. 



2 Smoothness Prior Modeling 



2.1 Flexible (Semi-parametric) Modeling 

A smoothing approach attributed to follows: Let 

l/n — fn ^ — Ij ■■■; ^ (1) 

denote observations, where /„ is an unknown smooth function, and £„ is an inde- 
pendently identically distributed (i.i.d.) normal random variable with zero mean 
and unknown variance cr^ . The problem is to estimate /„ , n = 1 , . . . , from the 

observations, yn,n = in a statistically sensible way. Here the number 

of parameters to be estimated is equal to the number of observations. Ordinary 
least squares or maximum likelihood methods yield meaningless results. Whit- 
taker \n\ suggested that the solution fn,n = 1, ...,N balances a tradeoff between 
infidelity to the data and infidelity to a /cth-order difference equation constraint. 
Namely, for fixed values of and k, the solution satisfies 



min 

/ 



' N N ' 



.n—l 



n—1 



( 2 ) 



The first term in the brackets in (0 is the infidelity-to-the-data measure, the sec- 
ond is the infidelity-to-the-constraint measure, and is the smoothness tradeoff 
parameter. Whittaker left the choice of A^ to the investigator. 



2.2 Automatic Parameter Determination via Bayesian 
Interpretation 

A smoothness priors solution explicitly solves the problem posed by Whit- 
taker im. A version of the solution is as follows: Multiply Q by — l/(2cr^) and 
exponentiate it. Then the solution that minimizes 0 achieves the maximization 
of 



1 



N 









exp 



2a2 



N 



/(/) = exp 



( 3 ) 
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Under the assumption of normality, Q yields a Bayesian interpretation 

7r(/|y, cxp(?/|cr^/)7r(/|A^cr^fc), (4) 

where 7 t(/|A^, a‘^,k) is the prior distribution of / and p{y\a^, /) the data distri- 
bution, conditional on cr^ and /, and 7r(/|j/, A^, a'^,k) the posterior of /. Akaike 
0 obtained the marginal likelihood for A^ and k by integrating (0 with respect 
to /. This facilitates an automatic determination of the tradeoff parameters in 
constrained least squares which has been treated subjectively for many years 
and eventually led to the frequent use of Bayesian method in statistical and in- 
formation science communities. Several interesting applications of this method 
can be seen in g]. 

2.3 Time Series Interpretation and State Space Modeling 

Consider a problem of fitting polynomial of order k — 1 defined by 

Un — — ^0 0,\TI 4- ’ ’ ’ -f ; (b) 

where e„ ~ N{0, cr^). It is easy to see that this polynomial is the solution to the 
difference equation 

AHn = 0 , ( 6 ) 

with appropriately defined initial conditions. This suggests that by modifying 
the above difference equation so that it allows for a small deviation from the 
equation, namely by assuming « 0, it might be possible to obtain a more 

flexible regression curve than the usual polynomials. A possible formal expression 
is the stochastic difference equation model 

Ahr, = Vn, (7) 

where ~ A^(0,r^) is an i.i.d. Gaussian white noise sequence. For small noise 
variance r^, it reasonably expresses our expectation that the noise is mostly 
very “small” and with a small probability it may take a relatively “large” value. 
Actually, the solution to the model is, at least locally, very close to a fc — 1th 
order polynomial. However, globally a significant difference arises and it can 
express a very flexible function. For /c = 1, it is locally constant and becomes a 
well-known random walk model, = tn-i + Vn- For k = 2, the model becomes 
tn = 2tn-i — tn -2 + Vn and the solution is a locally linear function. 

The models (0) together with m can be expressed in a special form of the 
state space model 

Xn = Fxn-i + Gvn (system model) 

yn = Hxn + Wn (observation model) , (8) 

where ~ A^(0, r^), Wn ~ A^(0, cr^) and = {tn, ■■■, tn-k+i)' is a fc-dimensional 
state vector, F, G and H are fc x fc, fc x 1 and 1 x fc matrices, respectively. For 
example, for fc = 2, they are given by 
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Xn 



F = 



2 -1 
1 0 ’ 




H=[1,0]. 



( 9 ) 



One of the merit of using this state space representation is that we can 
use computationally efficient Kalman filter for state estimation. Since the state 
vector contains unknown trend component, by estimating the state vector Xm 
the trend is automatically estimated. Also unknown parameters of the model, 
such as the variances and can be estimated by the maximum likelihood 
method. In general, the likelihood of the time series model is given by 



N 

L{0) = p(?/„|r„_i,6»), 

n—1 



( 10 ) 



where Kn-i = {yi, ■ ■ • , Vn-i} and each component 0) can be obtained 

as byproduct of the Kalman filter 0. It is interesting to note that the tradeoff 
parameter in the penalized least squares method 0 can be interpreted as the 
ratio of system noise variance to the observation noise variance, or the signal- 
to-noise ratio. 

The individual terms in (cni) are given by, in general multivariate case, 

p(y„|K„_i,6») = exp|-ie'„|„_iIK,;j^_^e„l„_i|, (11) 

where e„|„_i = y-n — yn\n-i is one-step-ahead prediction error of time series 
and yn\n-i ^^nd V„,\n-i are the mean and the variance covariance matrix of the 
observation ?/„, respectively, and are defined by 



yn\n—l — ^nXn\n—l (^^) 

= + (13) 

Here and Vn\n-i are the mean and the variance covariance matrix of the 

state vector given the observations and can be obtained by the Kalman 

filter . 



2.4 Modeling of Space-Time Data 

Let Z'^, (n = I, . . . , N;i = 1, . . . , /) be scaler observation at a discrete time of n 
for a station (site) i. Along the line mentioned above, we consider the following 
model to decompose into trend, T^, and irregular component, namely, 

K = K + DI^, (14) 

A direct approach to realize the Bayesian space-time model is given by consid- 
ering the following system model for each n 

T; = 2 + El,, El, ~ iV(0, for V *, (15) 

K-n = K, K-^1V(0,(</>(Z\*^))V) forV(*,j), (16) 
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where is some measure of a distance between station i and j, and 4> is 
usually assumed as a linear function truncated at Ath which is set to be the 
mean of distance between the neighboring points. Although this approach is 
desirable from the statistical viewpoint, its numerical realization on computer is 
impractical due to large memory required for a large number of / ~ 1, 000 that 
we usually deal with. For a case with lower dimensional model like I < 100, a 
simple approach to deal with T„ = [T^, T^, ■ ■ ■ ,T^ \ T^_i, . . . , T^-iY as a 

state vector can be implemented on a computer with large memory Eni- 

A simple way to mitigate this computational difficulty in the direct Bayesian 
approach for a case with / « 1,000 is to assume that each time series Z* = 
[Zl, Z 2 , ■ ■ ■ , Z^iY is mutually independent vector. This assumption allows the 
smoothness priors approach mentioned earlier to be employed. Then we use the 
system model given by HI only. The maximum likelihood estimates for cr^’* and 
are denoted by and r^’*, respectively. The Kalman filter and smoother 
with (T^’® and yield the estimates for the trend component, T^. The estimated 
irregular components is called the residual hereafter. A vector of 

the residual components for a station i is denoted by D* and a median of for 
each n, by T„. Similarly, their percentile points corresponding to ±<7 and ±2 <t 
intervals of versus n are denoted by and respectively. 

The next step for exploring the mean structure of the space-time data is 
to examine the spatial correlation of the residual components in terms of a 
correlation coefficient C”b between D* and D-1. For a fixed station i, a spatial 
distribution of C'*-' as a function of a distance measure has to be examined 
visually. In fact, a large number of hampers such kind of visual examination. 
Therefore, a plot of C'®-' versus guides us to further improvements on the 
mean structure of the space-time data. Obviously, when there appears many 
points with high correlation in the small value of A, taking the spatial correlation 
into account would improve an initial estimate on the mean structure of the 
space-time data, T^. Such kind of improvements can be realized by considering 
the following smoothness prior model for a spatial data 

= + K^N{0,r^) forV * . . 

^A^-^< = VY, forV(*,j-) ^ ’ 

where is an improved trend component. The iterative procedure mentioned 
above is practical for improving the estimates of the mean structure of the space- 
time data set j^. 

3 Applications 

3.1 Seasonal Adjustment, Earth Tide, and Groundwater 

The smoothness priors method has been applied to many real world problems 
t4|iij . The most of the economic time series contains trend and almost periodic 
components which make it difficult to capture the essential change of economic 
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activities. Therefore in economic data analysis, removal of these effects is impor- 
tant and it is realized by the decomposition 

Un — tn “t“ Sji -\- Wji^ (IS) 

where s„ and Wn are trend, seasonal and irregular components. A practical 
solution to this decomposition was given by the use of smoothness priors for 
both tn and |0|. 

Similar decomposition methods are developed for the analysis of earth tide 
data and groundwater data, where the time series is decomposed as 

Un — Pn Cn “t“ Tn ~\~ Wn-, 

where p„, e„ and are the barometric air pressure effect, the earth tide effect 
and the precipitation effect, respectively 0. By the decomposition of 10 years 
groundwater data with this model, the effects of earthquakes are clearly detected, 
and various knowledges on the relation between occurrence of earthquakes and 
the groundwater level are obtained M- 



3.2 Analysis of POS Data 

Analysis of Point-of-Sales (POS) scanner data is an important research areas of 
“data mining” and discovery science, which may provide store managers with 
useful information to control price or stock levels of goods. The effect measure- 
ments responding to price changes and semi-automatic sales forecasts of each 
brand may be useful in order to pursue price promotions efficiently and reduce 
the risk of “dead-stock” or “out-of-stock” . 

POS data set is consisted of a huge number of items and the analyses so far are 
mostly concentrated on the detection of mutual relation between items. In this 
subsection, we will show that, by the smoothness prior modeling of multivariate 
time series which takes into account of various components such as long term 
baseline sales trend, day-of-the-week effect and competitive effects, it is possible 
to discover the effect of temporary price-cut and competitive relation between 
several items. 

Assume that = [yn \ ■ ■ ■ Y denotes £ dimensional time series of sales 
of a certain product category, and Pn = [pn \ ■ . . ,Pn Y ^^e covariate expressing 
the price of each brand. The generic model we consider here for the analysis of 
POS data is given by 

Vn — £n dn Xn T ’^n-, (^^) 

where tn, dn, Xn and Wn are the baseline sales trend, day-of-the-week effect, 
sales promotion effect and observation noise. Each component of the baseline 

(j) 

sales trend, , is assumed to follow the first order trend model 

^n—1 ' ’ 



( 21 ) 
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The day-of-the week effect, dn \ can be considered as a special form of seasonal 
component with period length 7 and is assumed to follow 

= + + (22) 

The price promotion effect is assumed to be expressed by a linear function of 
nonlinear transformation of the price (price function) 

Xn=Bnf{Pn)- (23) 

In the analysis that follows, we assume that the price function is given by 

= exp {- 7 (n - no)} I a [^P^d^ ~ (24) 

where Apn'^ denotes the temporary price-cut from its regular (precisely the max- 
imum) price, 7 , a parameter, ng a starting point of price-cut, a condition 
that a price-cut is effective to cause sales increases, and ( ) an indicator func- 
tion. In actual modeling, this price promotion effect is further decomposed into 
Xn = 9n + Zm where is the category expansion effect and corresponds to the 
contribution to the increase of total sales. On the other hand, is the brand 
switch effect which is the increase of the sales of a brand obtained at the ex- 
pense of the decrease of other brands and does not contribute to the increase of 
category total. 

This model can be conveniently expressed in linear state space model and thus 
the numerically efficient Kalman filter can be used for state estimation, namely 
for the decomposition into components, and parameter estimation. Within var- 
ious possible candidate models, the best model was found by the AIC criterion. 

The presented model was applied to scanner data sets of daily milk category, 
for the period of 1994/2/28 - 1996/3/3 (iV=735). Five variate series consisted of 
top four brands and the others total were analyzed. Only two brands B1 and B2 
are shown on top of Fig. 0 The second plots show the estimated baseline trend 
components plus the estimated day-of-the-week effects. Only about 20% of the 
variation of the original series is explained by this day of the week effect. However, 
for other stores where the prices of brands did not change so significantly, the 
day-of-the-week effect contribute much more than this present case. 

The Fig.0shows the detected competitive relation between four major brands 
obtained via identified model. It can be seen that B3 is independent of other 
brand’s price promotion. Price-cut of B4 (B2) increases sales of B4(B2), but 
reduces those of B1 and B2 (Bl). Price-cut of B1 increases sales of B1 and does 
not affect the sales of other brands. The third plots of Fig. Qshow the estimated 
category expansion effect. The price-cuts of Bl and B2 contribute the expansion 
of category total. On the other hand, the bottom plots show the estimated 
brand switch components. The brand switch components of Bl and B2 are quite 
different. The plot for Bl indicates that the price-cut of Bl slightly contributes 
to the expansion of its own sales. But Bl is vulnerable to the price-cut of B2 
and B4. On the other hand, the price-cut of B2 considerably contributes to the 
increase of the own sales and B2 is slightly affected by the price-cut of B4. 
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Fig. 1. Decomposition of the brand B1 (left) and B2(right). Top plots: observed 
series, second plots: baseline trend plus day-of-the-week component, third plots: 
category expansion, bottom plots; brand substitution. 




Fig. 2. Competitive relations between four brands. 
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3.3 Analysis of GPS Data 



The GPS (Global Positioning System) is one of most interesting and impor- 
tant data set which allows us to investigate a global change in environment 
precisely. Its high precision information on positions of permanent stations can 
be supplied by signal processing of microwave signal from GPS satellite. Sev- 
eral physical quantities of media existing between the GPS satellite and ground 
stations affect phase information of microwave signals and result in propaga- 
tion delays. Therefore, a careful treatment of propagation delays is required to 
extract reliable information as to measurements of the positions. 

Dominant sources to bring about propagation delays are (1) ionosphere origin 
and (2) troposphere origin, such as atmospheric pressure and atmospheric water 
vapor uni. The propagation delay generated by the atmospheric water vapor, 
called the wet delay, is most difficult to evaluate among these factors. A good 
estimation on the propagation delay can be given to the ionosphere origin and 
atmospheric pressure origin sources, by utilizing other physical quantities mea- 
sured simultaneously. As a result, the wet delay turns out to appear as “noise 
source” in the processing of the GPS data and has to be subtracted prior to 
diagnosing the GPS data in terms of information on positions. 

In Japan, considerable efforts to establish a nationwide GPS array has been 
kept making by the Geographical Survey Institute of Japan (GSI) 0. The 
Japanese GPS array is characterized by its high spatial resolution; the array 
is composed of nearly one thousand stations separated typically by 15-30 km 
from one another m- Then, a proper processing of the GPS data set taking 
the wet delay effect into account allows us to estimate a high-frequent spatial 
pattern of the atmospheric water vapor, in particular, precipitable water vapor 
(PWV) which plays an important role in forecasting a weather map. Actually, 
an approach to extract information concerning the PWV from the GPS data 
draws much attention in a field of the meteorology and now is referred to as the 
GPS meteorology biloi . 

Our objective in this study is also aimed at finding rules to give a quantitative 
description for the relationship between the fluctuations observed in the GPS 
data and the PWV, and making it possible to give a PWV map with high spatial 
resolution. We begin with an analysis of the daily GPS array data provided by 
the GSI. Let Ujj be the nth day starting from January 1st, 1996 at the station 
(site) i: 



U). = K,i;:,Z;]' (*=l,...,/;n = iV:,...,A:) 



(25) 



where X, Y , and Z correspond to the north-south, east- west, and up-down 
components, respectively. A] and A* represent the starting and last date of the 
GPS data available to us now. / is the number of stations. 

Our preparatory analysis shows that the fluctuations associated with the 
PWV are most clearly seen in the up-down component, among the three 
components. Then, in this study, we focus on the up-down component Z"!^. Unfor- 
tunately, the original GPS array data contains the outliers as well as the missing 
observations. These unsatisfactory cases can be easily treated by a smoothness 
priors approach with the state space model, presented in Sect . |2T3l which provides 
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us with the reasonable interpolated data (see g] in detail) . The interpolation al- 
lows us to determine T„, and systematically. In Fig. 0we show T„, 
and obtained by applying the smoothness priors approach to Zl^. A 
seasonal pattern, which is expected to be associated with the PWV, is clearly 
seen in this figure. In addition, a relatively significant amplitude of the seasonal 
variation is found to be larger than the typical amplitude of the residuals, which 
can be approximated by a mean of the standard deviation of D*. Therefore, it is 
apparent that an extraction of precise information on the position from the GPS 
array requires an elimination of an effect of the PWV from the GPS data. A 
power spectrum analysis is performed on the T„ component and find no eminent 
peak except for a yearly cycle in a frequency domain. A detail investigation is 
being made on this figure to discover with what factors is associated from the 
viewpoint of a climatology. 



1996 1997 1998 




Fig. 3. The median, ±1 <t, and ±2(T percentile points of the estimated trend of 
the up-down component versus n, and 



Fig. a shows a plot of versus C^, where a unit of A is degree; roughly 
speaking, a distance of a degree corresponds to 111 km. In this figure, only 10,000 
points that are randomly drawn from about 180, 000 are shown for the sake 
of reducing a file size for this figure. An appearance of many points with high 
correlation in a small value of A clearly suggests that a residual sometimes shows 
a similar fluctuation with that in the neighboring stations. 

Three lines superposed on this figure are: 

C{A) = exp (— y) [Exp. dacay type] (Thin line) 

C{A) = (0.82)"^ [AR type] (Broken line) (26) 

C{A) = 1 — 0.36 • (Z\)'^-^® [Long Memory type] (Thick line). 
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The horizontal line indicates a value of 1/e. Each curve represents a correlation 
function induced from a model denoted in bold face. The thin and thick lines are 
drawn so as to resemble an envelope of the upper bound and +2<t percentile as 
a function of Z\ in a range of Z\ < 10. A good agreement of the thick line to +2cr 
envelope would implies that a long-memory-type spatial correlation {H ~ 0.15) 
HI happens to be observed for an atmospheric spatial pattern. An examination 
of the weather map for these cases is interesting, but the detailed discussion will 
be left to other places. 




Fig. 4. Plots of versus 



4 Conclusion 

The key to the success of the statistical procedure is the appropriateness of the 
model used in the analysis. The smoothness priors facilitates to develop vari- 
ous types of models based on prior information on the subject and the data. In 
this paper, we applied the smoothness priors method for the modeling of large 
scale time series and space-time data with mean value structure and competi- 
tive relations between variables. In the analysis of POS data, the time series is 
decomposed into several components and various knowledges to make a strat- 
egy concerning price promotion and risk control of dead-stock are obtained. 
The useful information for making a conjecture on the relationship between the 
propagation delay and PWV is successfully extracted based on the detailed in- 
vestigation on the trend and residual components. 
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Abstract. This paper proposes a method for detecting geomagnetic 
sudden commencement (SC) from a geomagnetic horizontal (H) compo- 
nent by using lifting wavelet filters. Lifting wavelet filters are biorthog- 
onal wavelet filers containing free parameters. Onr method is to learn 
such free parameters based on some training signals which contain the 
SC. The learnt wavelet filters have the feature of training signals. Apply- 
ing such wavelet filters to the test signals, we can detect the time when 
SC phenomena occurred. 



1 Introduction 

The earth’s magnetosphere is suddenly compressed when it is collided with the 
interplanetary shock emitted from the sun. Sudden increase of the magnetic 
field called geomagnetic sudden commencement (SC) is observed in the magne- 
tosphere. On the ground it is detected almost simultaneously all over the world. 

Studies of SCs are important because SCs can be used as a probe to study 
transient response of the complex system consisting of the magnetosphere, iono- 
sphere and electrically conducting earth to sudden increase of the dynamic pres- 
sure in the solar wind. It is desirable to detect and to accumulate as many as SCs 
to make statistical studies of the SC phenomena. Real time detection of SCs is 
also important because SCs are frequently followed by severe geomagnetic storms 
during which many hazardous accidents such as anomalies of electronic circuits 
on board satellites, large scale power line failures in high latitudes may occur. 

The SC in low latitudes is typically observed as a sharp increase of the 
geomagnetic horizontal (H) component. This sharp increase can be expressed 
as high frequency components which are obtained by a wavelet decomposition 
of the H-components. The wavelet decomposition is carried out by using low 
and high pass wavelet filters [5]. Intuitively, it seems that the high frequency 
components enable us to detect the SC phenomena. However, this is not useful 
according to our experience. 

In this paper, we present a method for detecting SCs by lifting wavelet filters. 
The lifting wavelet filters mean biorthogonal wavelet filters including controllable 
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free parameters [ 6 ]. Recently, such wavelet filters were designed so as to vanish 
the high frequency components for the compaction of electrocardiogram signals 
[1],[2],[3]. Our method is to learn the free parameters in lifting wavelet filters 
so that they feature the SCs. Utilizing the learnt parameters, we can construct 
adapting lifting wavelet filters by which it is possible to detect the time when 
SC phenomena occurred. 

In simulations, we will show results of our trial of automatic detection of 
SCs using the adapting lifting wavelet filter. Here we used the H-components at 
Kakioka station (geomagnetic coordinates: 26.9, 208.3). 

This paper is organized as follows. In Section 2, we describe a wavelet de- 
composition of signals. In Section 3, we survey a lifting wavelet filter. In Section 
4, we develop a detection algorithm. Section 5 shows an automatic detection of 
SCs. We close in Section 6 with concluding remarks and plans for future work. 

2 Wavelet Decomposition of Signals 

Let cl denote a signal with time parameter 1. Using multiresolution analysis in 
a wavelet theory, we can decompose the signal cj into low frequency and high 
frequency components as follows: 



Cm — 'y ' Ai_2mC; , 
1 


( 1 ) 


^ra ~ y ' jdl-2mCi , 


( 2 ) 



i 



where Am and flm are called decomposition filters. Conversely, we can reconstruct 
the original signal cj from the low frequency and high frequency components 
and dm by the formula 

C; = 'y ' A;_2mCm d~ ^ ' di-2mdm; 
m m 

where Am and /im are called reconstruction filters. 

For latter convenience, we put 

i^old \ ^old ,, 

^k,l — '^k—2h 9m, I — Mm— 2/5 

i^old \ 7 ,old ~ 

^k,l — '^k—2h 9m, I — Mm— 2/5 

and assume that the tuple of these filters satisfies the fol- 

lowing conditions 

E , E aZ'Kl = 0, 

I I 

E l old-old _ n old -old _ r 

''-k,iym,l ~ “ <Jmm' ■, 

I I 

which are called biorthogonal conditions. 
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Since a geomagnetic SC often occurs as a sharp increase of the H-components, 
it may appear in the high frequency components d^. As a simple algorithm for 
automatic detection of geomagnetic SCs, we can consider finding the time m 
when the absolute value of computed by Q) is large. To examine whether this 
simple algorithm is available or not, we compute the high frequency components 
of geomagnetic horizontal components in Kakioka station shown in FigureQ. 
The time when SC phenomena occurred is 10:22 which was indicated using the 




Fig. 1. Geomagnetic horizontal components (left) and the high frequencies 
(right). 



arrow in Figure El, left. The absolute value of high frequency component at this 
time, however, is not always the largest. 

To solve the problem, we propose a new detection algorithm using lifting 
wavelet filters, which will be described in Section 4. 

3 Lifting Wavelet Filter 

The lifting wavelet filters mean biorthogonal wavelet filters including control- 
lable free parameters constructed from initial biorthogonal wavelet filters. Here 
we choose the old biorthogonal wavelet filters as initial 

biorthogonal wavelet filters. 

We construct a new set of biorthogonal wavelet filters {hk,i, hk,i, gm,i, gm,i} 
as follows: 



hk,i = 



_ ^old 

gm,i — gm,lJ 

gm,l — gm,l / , 



Zold 



( 3 ) 



Automatic Detection of Geomagnetic Sudden Commencement 



245 



where Sk,m denote free parameters, hk,i and gm,i indicate low pass and high 
pass decomposition filters, and hk,i and gm,i indicate low pass and high pass 
reconstruction filters, respectively. It can be proved that the new biorthogonal 
wavelet filters satisfy the following biorthogonal conditions: 



’'^hkjhk',1 = Skk', gm,ihk,i = 0 , 

i i 

^ ^ ^ ^ gm,igm\l — ^mm' • 

I I 

4 Detection Algorithm 

4.1 Learning Method 

In this section, we denote an input signal again by cj and compute new high 
frequency components of cj by applying a new high pass decomposition filter 
(0J. It can be written as 

dm = ^ ^ 9m,lCi 
i 

l k 

— '^m ^ ^ ( 4 ) 

k 



where we put 

'm — / ^ 9m,lH 1 ~ . ^k,l H ■ 

I I 

Here, Vm and ak indicate the high and the low frequency components obtained 
by applying the old filters g'^i and to c]. The high frequency components 
Tm represent the details of the signal c\^ whose probability distribution is mainly 
focused around zero and behaves like a generalized Gaussian distribution. This 
means that the information of the input signal c\ concentrates on the large 
magnitude of high frequency components. For example, we show in Figured! the 
distribution of Vm for the signal in Figure Q 

Although such components certainly express the feature of the input signal, 
it is not easy to extract the feature in terms of fixed filters. Another problem is 
that the high pass decomposition filters gm,i are not shift-invariant. 

To get the essential features of signal, we determine free parameters Sk,m so 
as to vanish the high frequency component d!^ in Since includes several 
free parameters Sk,m for each to, we can extract various features of the input 
signal by putting 

dm ~ ^ ^ ^kSk,m — 0- 

k 
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Fig. 2. Distribution of r^- 



Coping with noisy data, we learn the free parameters Sk,m from several train- 
ing example as follows. We prepare 2n training signals cl’’' , v — 1, 2, . . . , 2n that 
include target events such as SC. Then, we impose on them the following con- 
ditions: 

m+n 

X] «Pfc.™-C=0. = 1,2, ...,2n, (5) 

k—m—n 

where we put 

i 

I 

At present, we have 2n+l unknown variables Sk,m, but the number of equations 
in P) is 2n. One more equation for determining Sk,m uniquely is a condition 
that the summation of the high path filters gm.,i is zero, that is, 

m+n 

E5-1 = E(5-E E 

I I k—m—n 



Since g '^‘^1 satisfy — 0, this condition is equivalent to 



m+n 

^ ^ ^k,m — 0 - 
k—m—n 

Writing P) and P) in the matrix form, we have 



^m—n 

^m—n 


*^m— n+1 
'^m— n+1 


. 1 

^m+n 

^m+n 




^m—n,m 
^m— n+l,m 




^m 

m 


m—n 


„2n 

^m-n+1 


'^m+n 




'5m+n— l,m 




„2n 

' m 


1 


1 


■ 1 




'5m+n,m 




0 



( 6 ) 



( 7 ) 
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We can solve ©by the GSiUSsiBiii GliininSitioii. Substituting th6 solutions 
into a learnt high pass filter gm,i can be obtained. 

4.2 Detection Theory 

In the previous subsection, the new wavelet filters gm,i were constructed by 
learning the free parameters Sk,m from several training signals. Now, we use the 
learnt filters g„i,i to detect the occurrences of target events in test signals. 

Let cj be a test signal containing the target event. First, we compute from cj 
the high frequency components and by the old and new high pass wavelet 
filters g!^'^i and gm,i, respectively. Remind that the parameters Sk,m of the new 
wavelet filters gm,i are optimized to vanish d^ at the time when the target event 
occurs. 

A possible strategy is to find the time m that makes d*^ = 0. Unfortunately, 
this strategy may detect the time m other than SC such that their high frequency 
components are almost zero for both d^ and d^. 

To avoid this kind of false detections, we search cj to find the time m so as 
to maximize the quantity 

= ( 8 ) 

When the value /m > 0 is larger than a certain threshold, we regard the time m 
as that of SC. In the case of geomagnetic SC detection, we may expect to detect 
the SC from other geomagnetic disturbances by the magnitude (S- 
We summarize the detection algorithm: 

1. Prepare several subsignals containing the target event, i.e. SC, from given 
signals on geomagnetic H-components for training Sk,m- 

2. Learn Sk,m by (Q for chosen elements in Step 1. 

3. Compute the high frequency components d^ for a test signal by making use 
of the old high pass filter. 

4. Apply the high pass filter gm,i learnt in Step 2 to a test signal to get the 
high frequency components d^. 

5. Detect the target event (SC) that maximizes Im in (Q- 

5 Automatic Detection of Geomagnetic SC 

5.1 A Set of Old Wavelet Filters 

There are a few biorthogonal wavelet filters. In this paper, we use the convolution- 
type wavelet filters as old wavelet filters. This type of wavelet is a kind of 
biorthogonal wavelet which was developed recently in [4]. The decomposition 
and reconstruction filters of convolution- type take the fol- 

lowing form: 
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i^l — 2m t 
{p* P)l-2m- 

The selection of indexes I — 2k and I — 2m means the down-sampling by two. The 
parameter p is a three-dimensional vector p — (p_i,poiPi) and * denotes the 
convolution symbol. If we choose p_i — pi = —0.5 and po = 2, the reconstruction 
filters ai and j3i are given in Table 1. The decomposition filters {p*a)i and {p*l3)i 
are easily constructed from a/ and and symmetric as shown in FigurQ. 



old _ 

9m, I 

-old _ 

9m,l 



Table 1. The coefficients a/ and /3i. 
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Pi 
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-0.000023342437 
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-0.000118322123 


-5 




-0.000389002680 


-4 




-0.002317266249 


-3 




-0.007181850389 


-2 


0.353553390593 


-0.074595873421 


-1 


0.707106781187 


-0.196528378993 
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0.353553390593 


0.562325895710 
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-0.196528378993 
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-0.074595873421 
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-0.007181850389 
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-0.002317266249 


5 




-0.000389002680 


6 




-0.000118322123 


7 




-0.000023342437 




Fig. 3. The initial wavelet filters {p* a) i and (p * j3)i. 



5.2 Simulation 

We attempted to detect the geomagnetic SCs from the H-component data col- 
lected in World Data Center C2 (WDC-C2) for Geomagnetism, Kyoto. Four 
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sampling types, that is, geomagnetic 1 second, 1 minute, hourly, and yearly 
value data are stored in WDC-C2, Kyoto. 

In this simulation, we used the H-components of geomagnetic 1 minute value 
data. These data have been stored in the WDC-C2, Kyoto since 1975 and the 
size of all data is about SOOMbytes only in Kakioka station. 

As training signals, we used 4 whole day’s H-components dated May 23, 
1989, June 14, 1990, April 4, 1991 and February 21, 1994 in Kakioka station. 
These training signals contain typical SCs as shown in Figure E] We denote these 





June 14, 1990 




April 4, 1991 



February 21, 1994 



Fig. 4. Training patterns. 



training signals by cf’’', v = 1,2, 3, 4. Therefore, the number of appeared 
in the previous section is five and the equation (E) is written as 



Om-2 Om-1 Om-K «m+2 




^m—2,m 




/■I 

m 


2 „2 „2 „2 




~ 




J2 


^m-2 ^m-1 ^m+1 ^m+2 




'5m— l,m 






^m-2 ^m-l ^m+1 ^m+2 




'5m,m 


= 


^3 

m 


4 „4 „4 „4 „4 




~ 




^4 


^m-2 ^m-1 ^m+1 ^m+2 




'5m-|-l,m 






11111 




'5m+2,m 




0 



Solving (B^, we can obtain the free parameters and construct an adaptive 
filter by (3). We illustrate the filter in Figure ID 

Applying the filter above to the 17 Kakioka’s H-components in 1990 which 
contain the SC phenomena, we could obtain the results in Table 0. In Table 0, 
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Fig. 5. Adaptive filter. 



the detection times were arranged in order of the magnitude Im and the times 
of target events were underlined. Table [3 shows that our method is useful for 
the automatic detection of SCs. Figure El shows the computed values Im for the 



Table 2. Detection results. 



Date 


Time of SC 


Detection Times 


Feb 


13, 


1990 


17:15 


17:15. 


23:35, 


12:32, 


01:28, 


01:58 


Feb 


15, 


1990 


06:26 


21:57, 


21:51, 


22:48, 


22:35, 


22:40 


Mar 


12, 


1990 


15:03 


15:04. 


23:21, 


23:06, 


15:03. 


22:14 


Mar 


20, 


1990 


22:44 


22:44. 


22:43, 


03:22, 


12:59, 


00:26 


Mar 


30, 


1990 


07:21 


07:21. 


07:20, 


09:48, 


09:30, 


00:36 


Apr 


9, 


1990 


08:43 


08:43. 


10:30, 


22:37, 


16:27, 


13:48 


Apr 


12, 


1990 


03:26 


05:06, 


05:23, 


04:38, 


03:45, 


05:30 


Apr 


17, 


1990 


07:19 


07:20. 


00:42, 


21:55, 


19:33, 


08:46 


Apr 


23, 


1990 


03:36 


05:27, 


05:25, 


03:37, 


05:07, 


19:40 


May 


18, 


1990 


07:40 


13:04, 


07:40, 


07:39, 


13:36, 


22:13 


May 


21, 


1990 


10:22 


10:21. 


20:45, 


02:03, 


22:15, 


23:03 


May 


26, 


1990 


20:37 


20:37. 


20:38, 


21:56, 


23:25, 


23:27 


Jun 


12, 


1990 


08:20 


20:59, 


18:05, 


20:58, 


22:58, 


14:59 


Jul 


28, 


1990 


03:31 


03:31. 


05:58, 


05:56, 


07:20, 


06:31 


Aug 


1, 


1990 


07:41 


07:41. 


07:42, 


11:22, 


14:11, 


07:45 


Aug 


26, 


1990 


05:43 


11:05, 


05:43, 


06:12, 


14:07, 


11:50 


Nov 


26, 


1990 


23:32 


23:32, 


10:22, 


01:57, 


03:01, 


19:58 



H-components in the left side of Figure Q The peak in Figure Ifil means SC. 
However, the peak of the high frequencies in Figure Q appears in different time. 

6 Conclusion 

We proposed an automatic detection method of geomagnetic sudden commence- 
ment using lifting wavelet filters. In the simulation, we applied this detection 
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Fig. 6. Computed values Im- 



algorithm learnt using four training signals to some test signals observed at 
Kakioka station in 1990, which contain SC phenomena. The experimental re- 
sults show that we succeeded the automatic detection of SCs. However, high 
frequency components with huge magnitude obtained by using the old filter 
make Im large even if the learnt wavelet filters are applied or not. It is a future 
work to improve the criterion Im- 
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Abstract. Within the empirical ILP setting we propose a method of 
inducing definite programs from examples — even when those examples 
are incomplete and occasionally incorrect. This system, named NRMIS, 
is a top-down batch learner that can make use of intensional background 
knowledge and learn programs involving multiple target predicates. It 
consists of three components: a generalization of Shapiro’s contradiction 
backtracing algorithm; a heuristic guided search of refinement graphs; 
and a LIME-like theory evaluator. Although similar in spirit to MIS, 
NRMIS avoids its dependence on an oracle while retaining the expres- 
siveness of a hypothesis language that allows recursive clauses and func- 
tion symbols. NRMIS is tested on domains involving noisy and sparse 
data. The results illustrate NRMIS’s ability to induce accurate theories 
in all of these situations. 



1 Introduction 

An inductive logic programming (ILP) system can be loosely characterized as 
concept discovery tool that uses logic programs to describe its hypotheses. The 
use of logic programs as a hypothesis language offers several advantages over 
other choices of hypothesis representation such as clusters or decision trees. 
These advantages include the ability to describe a very rich class of concepts 
and a convenient way of providing systems with background knowledge about 
a domain. Also, hypotheses output as logic programs are generally easier for 
humans to understand and analyze. 

There has been a recent trend in ILP to develop systems that can perform 
well in a variety of domains that are grounded in practical settings. Difficulties 
such as missing or inconsistent data are commonplace in these domains and have, 
to an extent, prevented the successful application of ILP techniques to real world 
problems. 

The issue of inconsistent or noisy data has been addressed by systems such 
as Foil mFoil ra, Progol m and Lime m to name but a few. A 
common characteristic of these systems, besides their robustness, is that their 
performance on training sets with missing or sparse data is fairly poor. On the 
other hand, systems such as Mis P), Lopster 0, Crustacean p, Skilit 
JZ), and Foil-i j0| can learn from sparse data but do not handle noisy data at 
all. Systems that are robust to noisy training data tend to be systems that use 
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extensional (example-based) cover when determining the fitness of a hypothesis. 
Conversely, systems that use proof-based or intensional cover generally perform 
well on sparse data. 

There has also been a growing interest in multiple predicate learning within 
ILP. A handful of systems (Mis^n], MPL|2j, Nmpl|S|, and MuLTjCNjTnj) can 
correctly induce programs with multiple target predicates but none can handle 
any amount of noise. All but Multjcn are intensional systems. 

In this paper we propose an ILP system, Nrmis (pronounced “near-miss”), 
which modifies Mis so it can learn from noisy examples and without an oracle. 
Nrmis inherits from Mis, amongst other things, the ability to learn multiple 
target predicates simultaneously while improving on the efficiency of the search 
method of its predecessor. This can be seen as a step towards integrating noise 
handling techniques with the benefits of an intensional system. 

Central to the Nrmis algorithm presented in Section 0 are the refinement 
graphs and markings described in Section 0 In Section ^ Nrmis is tested on 
several domains, covering the range of difficulties discussed earlier. A discussion 
of these results and future work is given in Section 

We use the following description of an ILP problem to set the scene and 
introduce some notation. Any concepts not thoroughly defined here can be found 

in [T^. 

Let L be a first-order logic derived from an alphabet which contains only 
finitely many predicate and function symbols. A unknown model M assigns a 
truth value to each ground fact in L. Ground facts of L are called examples 
and those which are true (resp. false) in M. are called positive (resp. negative) 
examples. Given background knowledge B (a finite set of clauses true in Ad), a 
set of positive examples, and a set E~ of negative examples we wish to find 
a hypothesis E (a finite set of clauses) that entails every positive example in E~^ 
and none of the negative examples in E~ . Such a hypothesis is called correct. 

We will say a set of examples is noisy with respect to a target hypothesis 
A" if LI is not correct with respect to the examples. In this situation the aim is 
to find a hypothesis that has the same extension as the target hypothesis. The 
noise model we will use in this paper is outlined in Section lOl 

The system described in this paper addresses the ILP problem in the empir- 
ical setting which requires that: the entire example set be given in advance, 
no queries are made of the underlying model M, and the initial hypothesis con- 
tains no definitions for the predicates being learned. Also, we will only consider 
the ILP problem for definite programs. This allows us to use SLD-refutation, 
denoted by h, as the proof procedure to determine the intensional cover of a 
theory. In order to guarantee the termination of this procedure we have imple- 
mented it with a depth-bound which, in our experiments, is set large enough to 
avoid any problems. 
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2 Refinements and Markings 

The problem of finding a correct hypothesis can be seen as a search of the space of 
all definite clauses in L. This search space, we will call it Tdef, can be structured 
through the use of a refinement operator m- 

Given a definite clause C and a refinement operator p the set p{C) contains all 
of the most general specializations of C in Ldef- The specialization order placed 
on the clauses in Ldef is that of Plotkin’s 0-subsumption nni. The refinement 
operator and specialization order induce a refinement graph over the clauses in 
Ldef- This graph has a directed edge from clause C to clause D if and only if 
D € p{C), i.e., when Z? is a refinement of C. 




Fig. 1. A marking for Tdef’s refinement graph 



A marking m is a structure that is used intensively by Nrmis when search- 
ing for a correct hypothesis. It consists of three finite subsets of Ldef^ the current 
hypothesis Mcur, a set of deleted clauses M^ei, and a set of clauses, Mpass, marked 
passed. The marking consisting of these sets will be denoted M . A diagrammatic 
example of a refinement graph and a marking is given in Figure Q The alphabet 
for L in this case consists of the predicate even/1, the function s/1 and the 
constant 0. The refinement operator used here is named p 2 in m and is based 
on a context-free transformation. 

When the language L contains many predicates and function symbols the 
sets p{C) can be quite large which causes the refinement graph for L to grow 
extremely quickly. In order to keep the search of this graph managable Nrmis 
makes use of user-provided mode and type information as well as the search 
heuristics outlined in Section 

3 The Nrmis Algorithm 

As the name suggests Nrmis is a modification of Shapiro’s Mis j^. Both are 
top-down intensional algorithms that form hypotheses by searching a space of 
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definite clauses using a refinement graph. Unlike Mis our system is not an in- 
cremental learner, that is, all the training examples are given to our algorithm 
in advance. Also, Nrmis does not require an oracle and can tolerate noise in the 
training examples it is given. These advantages are due to a generalization of 
Shapiro’s contradiction backtracing algorithm and a theory evaluation heuris- 
tic similar to Lime’s El, respectively. These are detailed in Section El and 
Section El 

An overview of Nrmis is given in Figure 0 Input to this algorithm consists 
of background knowledge B in the form of a definite program, a set of positive 
examples if + and a set of negative examples E~ . The marking M is initialized 
and specialization/generalization loop is entered. The following subsections de- 
tail the function of the subroutines Decision, Generalize, Specialize, and 
Compress. 



mms{B,E+,E~) 

Q ;= Empty priority queue 

Mcur := { □ } 

Mdei:={ } 

Aipass := { } 

repeat 

-Ebad := {e“ & E~\BU Mcuv b e“} 

ELd ■■= E+ - {e+ gE+\BU Maur b e+} 
case Decision(A-^j, E+^^) of 
specialize: Specialize (M, 
generalize: Generalize (M, 
insert(Compress(Mcur), Q) 
until E+^UE~^^ = 0 
output head of Q 



NRBackTrace(e ,6) 

R := a proof of e~ 

repeat 

(G,A,C) last(A) 
case Query truth value of A of 
true: cr := 0, r := 1 
false: ct := 1, r := 0 
unknown: a t ~ ^ 

blame(C', b, a) 
b := rb 
R := init(i?) 

until {R is empty ) V (6 = 0) 



Fig. 2. The NRMIS Algorithm 



Fig. 3. NRBackTrace 



Ideally, the Nrmis algorithm terminates when a correct hypothesis is found. 
This is not always possible when noise is present in the training examples. To 
overcome this the present implementation of Nrmis places an upper bound on 
the number of times the body of the main loop can be executed. If this upper 
bound is reached the best hypothesis in the priority queue Q is taken to be the 
system’s final hypothesis. A hypothesis’ position in Q is determined by a ranking 
given to it by Compress. 



3.1 Decision 

In Shapiro’s Mis a generalization and specialization step is performed in every 
iteration of its main loop. When noise is present in the data this approach is too 
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coarse and frequently overlooks good candidate hypotheses. By separating these 
two steps Nrmis implements a finer search of the refinement graph. 

Each repetition of the main loop in Figure |2| calculates the set of 

negative examples that are implied by Mcur, and E^^^, the set of positive exam- 
ples not implied by Mcur- The routine Decision guides the search for a correct 
hypothesis at a high level. It is a heuristic that simply aims to minimize the pro- 
portion of bad positive or bad negative examples. If the proportion |T'i7adl/l'®~l 
is larger than |£'i^adl/l'®^ ^ specialization is performed this loop, otherwise a 
generalization takes plac^ 

3.2 Specialize 

The procedure Specialize is used to reduce the number of negative examples 
covered by Mcur- It modifies the marking M by moving clauses from Mcur to 
Mdei. The clauses to be moved depend on what examples appear in E^^^. 

Specialize relies heavily on a modified version of Shapiro’s contradiction 
hacktracing algorithm EH- We will refer to the original as BackTrace and the 
modification NRBackTrace. The main difference between them is the former re- 
quires access to an oracle - in the form of an enumeration of the underlying 
model or a user answering queries - whereas NRBackTrace does not. 

If a negative example e~ can be covered by Mcur then there must be a 
SLD-refutation of e~ using clauses from the background knowledge and Mcur- 
We will represent an SLD-refutation of e~ as a sequence of resolution steps, 
Ri = {Gi, Ai,Ci) for i = 1 ...n. Each Gi is a goal (a definite clause with no 
head literal) , Ai is an atom of Gi and Gi is a clause from Mcur U B such that its 
head unifies with Ai. For i = 1 ... n — 1, Gi+i is the resolvent of Gi and Ci on 
Ai. Gi is the goal <— e~ and resolves with G„ to form the empty clause □. 

If i? = i?i . . . Rn is a proof then we denote by last(i?) the final resolution 
step Rn and the initial part of the proof, Ri . . . by init(i?). We illustrate 

this somewhat cumbersome definition with an example. 

Example 1. Consider the following definite program E for the ternary predi- 
cates representing addition, add/3, and multiplication, mult/3, in terms of the 
successor function s/1: 



fii: mult{A^B,B). oci: add{A, 0 ,A). 

^2 ■ mult{A,s{B) ,C)'^mult{A,B ,Z) ,add{A,Z,C). a.2 ■ add{A,s(B) ,s{C))-> — add{A,B ,C). 

: mult{s(A) ,B ,C)-> — mult{A,B,Z),add{B,Z,C). 



This program is overgeneral due to the clause pi, hence we have an SLD- 
refutation of mult{\, 2, 3) as seen in Figure g| Each resolution in the sequence is 



represented as 




and each Ai is underlined in Gi. 



We now detail our version of the backtracing algorithm, NRBackTrace, as 
shown in Figure]^ It is used to allocate a given amount of “blame”, b, amongst 



^ I - I is used to denote the cardinality of a set. 
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•< — mult(l,2,3) 


^3, 


— mult(0, 2, 1) , add(2, 1,3) 




< — mult(0, 2, 1) , add(2, 0,2) 
















^ — mult(0, 2,1) 




^ — muZt (0,1 , 1) , add(0,l, 1) 




< — muZt(0,l,l), add(0,0,0) 


0^1. 






□ 







Fig. 4. An SLD-resolution of mult{l,2,3) from S (see Example 0 



the clauses used in the proof of a negative example e~ . A proof, R = Ri . . . i?„, 
of e~ is found and the repeat loop is entered. Here the last resolution step, 
(G,A,C), is examined. If A can be shown to be false in the underlying model 
(via background knowledge or examples) then C must be false in the model as 
its head is false while its body is true (the rest of the steps in the proof show that 
each of C’s body literals is true). C is therefore given all of the blame currently 
available by a call to blamie(C, 6, a) which adds axb units of blame to the clause 
C. Thus, in the case when A is false a is set to 1. 

If A is true the blame is passed back to the resolution step that led to G by 
setting r to 1 and the process continues. Finally, if there is conflicting evidence 
for A’s truth value or it is simply not known, half the blame is given to C and 
the rest is propagated back through the proof. An example of this process can 
be found in Example 0 

Note that if the truth values of every atom in the proof are known then 
NRBackTrace will allocate all the blame to exactly one clause. In this sense 
NRBackTrace can be seen to generalize Shapiro’s BackTrace algorithm to an 
oracle-free setting. 

Example 2. Suppose we have the following sets of (noisy) positive and negative 
examples. 



E~^ = { mult{0, 2, 1), add{0, 0, 0),add{2, 0, 2) }, 

E~ = { mult{0, 2, 1), 2, 3)) }, 

and our hypothesis is the theory E given in Example 0 The result of calling the 
procedure NRBackTrace with inputs 6=1 and e~ = mult(l, 2, 3) is given in the 
table below. 



Clause 


Ml 


M2 


M3 


tti 


02 


Blame 


1 

2 


1 

8 


1 

16 


0 


5 

16 



A description of the procedure Specialize can now be given. For each e~ G 
^ha.d ^ NRBackTrace is made and blame is accumulated between calls 

(i.e., the blame is summed over every proof of a negative example a clause is 
involved in) . At the end of this the most “guilty” clauses are removed from Mem- 
and placed in Mdei. The most guilty clauses will be in some way responsible for 



258 



Eric McCreath and Mark Reid 



the hypothesis covering negative examples, therefore their removal from Mem 
will result in a specialization. 

A brief discussion of the time complexity of these algorithms is instructive 
at this point. The most time consuming part of the Specialize procedure is 
determining if negative examples are covered by its current hypothesis. As Nr- 
Mis uses intensional cover, if a negative example is not covered by a hypothesis 
H, every possible depth-bounded SLD-refutation must be tried before returning 
a negative result (this is a version of negation by finite failure EH)- When H 
contains many interrelated and complicated clauses this search can be very in- 
efficient. Care is taken when generalizing hypotheses to minimize this problem 
through the use of mode and type information as well as a bias for the addition 
of simple clauses over more complicated ones. 

Once a proof of a negative example is found the NRBackTrace procedure’s 
time complexity is roughly linear in the number of resolution steps in the proof. 
For each resolution step an additional expense is incurred when determining 
whether or not the atom resolved upon is true or false. This expense is dependent 
on the complexity of the background knowledge and can be reduced somewhat 
through standard dynamic programming techniques. 

3.3 Generalize 

The Generalize procedure used in the NRMIS algorithm (Figure 0) is much like 
that found in Mis. When there are positive examples not covered by Meur clauses 
need to be added to the hypothesis. Clauses from M^ei are moved to Mpass and 
the refinement operator p is applied to them. New clauses are chosen from these 
refinement sets and added to Mem- 

As the refinement graph can grow exponentially in the worst case it is imper- 
ative to keep the overall search efficient. A heuristic is therefore used to decide in 
which order the clauses in Mjei should be refined. Given a way of measuring the 
“size” and “utility” of a clause preference is given to smaller and more useful 
clauses. We use the following definition for the size of a clause: The size of a 
clause is equal to the number of symbols, including punctuation, that appear in 
a clause minus the number of distinct variables (cf. tivl ). A useful property of 
this size measure is that there are only finitely many clauses of any given size. 

Like the notion of guilt used by Specialize, the utility of a clause is based 
on its involvement in the proof of examples. Each deleted clause C has a set 
covers^ (C) of positive examples it helped cover when it was part of the hypoth- 
esis H . The utility of C is then defined to be the number of examples the sets 
covers^ (C) and have in common. Thus, a clause is considered useful if, in 
a past hypothesis H , it was used to cover several positive examples that are not 
covered by the current hypothesis Mem- 

Once a clause C G Mdei is chosen for refinement all the clauses in p{C)—Mpi^s 
are added to Meur- Clauses in Mpass are not considered for addition to the 
hypothesis as they have already been considered and deleted. Clause C is moved 
from Mdei to Mpass before the main loop of the Nrmis algorithm is executed 
again. 
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In Mis smaller clauses are also refined before larger ones but there is no 
method of choosing one clause over another should a tie occur. Instead, Mis 
refines all the smallest clauses and adds the resulting clauses to the overspe- 
cific hypothesis. By only adding the refinements of a single clause each time 
Generalize is called our system implements a much finer and more directed 
search for new hypotheses. As these hypotheses tend to be smaller, deciding 
what examples a hypothesis covers is usually less complicated in Nrmis. 

3.4 Compress 

Finally, the procedure Compress plays an important role in making Nrmis noise 
resistant and reducing the redundancy of the programs it outputs. 

When a percentage of the examples given to a learning system are misclassi- 
fied there is a tendency for the system to output hypotheses that explain these 
noisy examples. A common feature of these overzealous hypotheses is their large 
size. Striking a balance between accuracy and size is therefore a reasonable way 
of assessing the quality, Q{H), of a hypothesis H. This philosophy is embodied 
in the Q-heuristic used in Lime im. Compress uses this heuristic to find an 
accurate and concise hypothesis Mbest C Mcur- This is done using the following 
greedy strategy: An initial pruning of Mcur takes place in which only clauses 
that are involved in proofs of positive examples are kept. This results in a set 
Hq = { Cl, . . . ,Cn } ^ Mcur- For each f = 1 . . . n we let i/i = Mq — { Ci }. If 
Q{Hq) > Q{Hi) for all i then Compress returns Hq as the best hypothesis. Oth- 
erwise, the Hi with the highest quality is pruned and made the new Hq and the 
procedure repeats. As this is a greedy search only 0{ri?) subsets of the original 
n clause Hq are considered before the procedure terminates. 

It is important to note that Compress does not actually modify the marking 
M in any way, rather, it outputs a compressed version of Mcur along with a 
number indicating its quality. It is this quality estimate which determines the 
hypothesis’ position in the priority queue used in Figure 0 

4 Experimental Results 

The focus of our experimental results is to demonstrate Nrmis’s ability to cor- 
rectly identify relations from training sets that are either incomplete or noisy. 
For these experiments we used an implementation of Nrmis written in the func- 
tional language Haskell which was generally an order of magnitude slower than 
the other systems. 

4.1 Sparse Data 

In 1^ a series of experiments are conducted that compare the performance of 
Foil, Progol and Foil-i on sparse or incomplete training sets. Table 0shows 
how Nrmis and Skilit |7j compare to these systems when tested on the same 
domains. 
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The domains used were member/2, length/2, last/2 and nth/3. For each do- 
main a complete initial portion of examples is generated. For example, in the 
member/ 2 domain all possible ground facts involving the constants a,b and c 
and lists of length at most three were used. A training set drawn from this initial 
portion is said to have density 80% if the number of positive examples in the 
training set makes up 80% of the positive examples of the initial portion and the 
number of negative examples in the training set makes up 80% of the negative 
examples in the initial portion. For each density fifty random training sets are 
generated and each system’s output on these training sets is checked for its cor- 
rectness. The number of training sets the system correctly induced a hypothesis 
on is given in Table 



Table 1. Comparing the performance of Nrmis, Foil-i, Foil, Skilit and Pro- 
GOL when learning from sparse data in various domains. The results for all the 
systems apart from Nrmis and Skilit are taken from 0 





Correct theories (out of 50) 




Correct theories (out of 50) 




Nrmis 


Skilit 


Foil-i 


Foil 


Progol 




Nrmis 


Foil-i 


Foil 


Progol 


Density 












Density 










member 


length 


80% 


50 


50 


50 


41 


26 


80% 


41 


38 


38 


39 


50% 


50 


50 


50 


36 


20 


50% 


29 


18 


18 


13 


30% 


50 


46 


50 


16 


5 


30% 


22 


6 


4 


0 


20% 


50 


44 


49 


8 


2 


20% 


17 


4 


2 


0 


10% 


41 


38 


38 


3 


0 


10% 


0 


0 


0 


0 


7% 


31 


23 


22 


0 


0 












last 


nth 


80% 


50 


50 


50 


45 


21 


80% 


50 


50 


0 


43 


50% 


50 


44 


44 


24 


6 


50% 


47 


49 


6 


43 


30% 


50 


33 


33 


25 


0 


30% 


28 


46 


5 


19 


20% 


49 


29 


23 


13 


0 


20% 


33 


27 


0 


0 


10% 


28 


7 


2 


2 


0 


10% 


13 


1 


0 


0 



We were unable to obtain sensible results from Skilit on the length/2 and 
nth/3 domains hence these results have been omitted from the table. This is 
most probably due to problems with our configuration of Skilit rather than a 
shortcoming of the system itself. 

Unlike Foil-i and Skilit, Nrmis was not specifically developed to induce 
logic programs from sparse examples. Nevertheless, its performance on these 
training sets compares favourably against other two systems. This is a testament 
to the generality of the top-down refinement graph search used in Nrmis. 



^ For more information these experiments and some example training sets the reader 
is referred to 0 and http://www-itolab.ics.nitech.ac.jp/research/ilp/foili.html 
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4.2 Noisy Data 

Nrmis’s noise handling ability is demonstrated when it is required to learn add/3, 
the addition relation. Its target concept can be found in Example^ 

Noise- free positive examples are generated by randomly choosing an instance 
of add/3 and checking to see if it is true with respect to the target concept. If so, 
it is taken as a positive example. If not, another random instance is generated and 
this process continues until a positive example is found. An analagous process is 
used to generate noise-free negative examples. A noise rate oi v G [0, 1] means, 
with probability v, an instance of add/3 will be drawn randomly and classified 
as positive (negative) without consulting the target concept. 




Fig. 5. Predictive Error vs Noise on the addition domain 



Figure 0 compares the predictive error vs. noise rate curves for Nrmis, 
LiME fiiiiizl . Progol^^, and Foil^^ 0. For each noise rate, each of the sys- 
tems are given a training set of 200 positive and 200 negative randomly gener- 
ated, noisy examples. The predictive error for each hypothesis is approximated by 
averaging the proportion of misclassified positive and negative examples. These 
examples are drawn from a noise-free set of all possible instantiations of add/3 
with entries no greater than six (28 positive and 315 negative examples). Twenty 
of these trials are performed at each noise rate and the predictive error shown 
in the graph is averaged over these trials. As can be seen Nrmis performed 
similarly to Lime which also uses a theory evaluator based on the Q-heuristic. 

5 Discussion and Futnre Work 

We have presented in this paper an ILP system called Nrmis which is both 
noise-resistant and based on an intensional notion of cover. ILP systems in the 

^ We are using CProgol4.4 with the intensional cover flag set and Foil6.4 
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past have only ever met one of these two criteria and so have trouble with one or 
more of: multiple predicate learning, learning from sparse data, and learning from 
noisy data. The experiments outlined in this paper show Nrmis to be capable 
of performing well on the last two of these. In addition, we have been able to 
get Nrmis to correctly induce programs for the mutually recursive predicates 
male- ancestor/ 2 and female- ancestor/ 2 (as described in as well as odd/1 
and even/1. 

The success of Nrmis is due to judicious use of the Q-heuristic and a modifi- 
cation of Shapiro’s contradiction backtracing algorithm which removes the need 
for an oracle. Combined with the elegance of Mis’s method of searching refine- 
ment graphs we have produced a system that uses a rich language to represent its 
hypotheses - recursive logic programs involving function symbols. This expres- 
siveness comes with one major drawback, however, and that is the inefficiency 
of the search which is particularly apparent when the language consists of a 
large number of predicates and function symbols. Some progress has been made 
towards taming this problem both here and in other systems. We have managed 
to obtain large improvements in speed over a simple MiS-like refinement graph 
search while keeping the search complete. To get further improvement it may be 
necessary to consider greedy, incomplete strategies. 

We have done some preliminary tests of Nrmis on larger domains such as 
document understanding 0 with some promising, though inconclusive, results. 
The biggest hurdle here is that the domain consists of over fifty predicates which 
means the refinement graph for this domain gets large very quickly. Further com- 
pounding the problem is the large number of examples that must be considered. 
We believe that, in principle, the Nrmis algorithm is capable of inducing an 
accurate hypothesis in this domain but our current implementation needs some 
optimizing before this will happen. This is our main focus for the near future. 

Other ongoing work includes looking at formalisms that can provide theo- 
retical basis for our modified backtrace algorithm, especially within a multiple 
predicate learning framework. 
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Abstract. We present an efficient method for statistical parameter learn- 
ing of a certain class of symbolic-statistical models (called PRISM pro- 
grams) including hidden Markov models (HMMs). To learn the parame- 
ters, we adopt the EM algorithm, an iterative method for maximum like- 
lihood estimation. For the efficient parameter learning, we first introduce 
a specialized data structure for explanations for each observation, and 
then apply a graph-based EM algorithm. The algorithm can be seen as 
a generalization of Baum- Welch algorithm, an EM algorithm specialized 
for HMMs. We show that, given appropriate data structure, Baum- Welch 
algorithm can be simulated by our graph-based EM algorithm. 

1 Introduction 

To capture uncertain phenomena in a symbolic framework, we have been devel- 
oping a symbolic-statistical modeling language PRISM in the past years j1 d|1 1 ] 
PRISM programs is a probabilistic extension of logic programs based on dis- 
tributional semantics, and its programming system has a built-in mechanism 
for statistical parameter learning from observed data. For parameter learning, 
we adopt the EM algorithm, an iterative method for maximum likelihood es- 
timation (MLE). With this learning ability built into the expressive power of 
first-order logic, as shown in m PRISM not only covers existing symbolic- 
statistical models ranging from hidden Markov models (HMMs) m to Bayesian 
networks (BNs) |Hj and to probabilistic context-free grammars (PCFGs) but 
can smoothly model the complicated interaction between gene-inheritance and 
a tribal social system discovered in the Kariera tribe HU. 

Our problem with the current learning algorithm for PRISM is that although 
our learning is completely general, it lacks the efficiency achieved by other spe- 
cialized EM algorithms such as the Baum- Welch algorithm f1 p8] for HMMs. We 
therefore propose an efficient learning framework, in which the learning can be 
done as efficiently as these specialized EM algorithms without hurting the in- 
tegration of learning and computing. By inspecting closely our EM algorithm, 
we have found it is possible to eliminate computationally intractable part of 
it by imposing a couple of reasonable conditions on modeling, which results in 
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a general but efficient learning algorithm running on a special type of graph 
structures. 

The purpose of this paper is to show mathematically how this improvement 
becomes possible for what reason. The rest of the paper is organized as follows. 
We first modify the semantic framework, slightly, of PRISM in such a way that 
the modification justifies the elimination of computational redundancy which 
was inherent in our EM algorithm. Next we introduce a special class of directed 
acyclic graphs, each of which is a structural representation of explanations for an 
observation. We call them support graphs. We then show that the graph-based 
EM algorithm, a new EM algorithm implemented on support graphs, attains the 
same time complexity as the Baum- Welch algorithm. 

2 Modifying PRISM to PRISM* 

2.1 Distributional Semantics 

We here shortly describe distributional semantics, the theoretical basis of PRISM 
(See 0 for details), and some preliminary definitions. A program DB we deal 
with is written as DB = F U R where E is a set of facts (unit clauses) and R 
is a set of rules (clauses with a non-empty body). In theoretical context, DB 
is considered as a set of (possibly infinitely many) ground clauses. We define a 
joint distribution Pp on the set of all possible interpretations for E, and think 
of each ground atom as a random variable taking 1 when true and 0 otherwise. 
We call Pp a basic distribution. Then, there exists a way to extend Pp to a 
joint distribution Ppp on the set of all possible interpretations for ground atoms 
appearing in DB. The denotation of a logie program DB is defined as Pdb- 

We here put head{R) as the set of heads appearing in R and then assume 
that E n head{R) = 0 holds. In such a case, DB is said to be separated. To 
make matters simple, we further assume that there is a fixed set of ground 
atoms C head{R) representing possible observations, and our observation 

corresponds to randomly picking up one of atoms in as a goal (to be proved 
by our modeling program). Collecting T observations means statistically making 
T independent selection of goals Gt (Gt G G^““, 1 < t < T). 

For each G G ^ assume there are finite sets . . . , of ground 
atoms from E such that iff(i?) h G V • • • V where iff(E) is the com- 

pletion 0 of R. Each of . . . , is called a support set for G. A minimal 
support set (or an explanation) is a support set which is minimal w.r.t. set inclu- 
sion ordering. For later use, we introduce ipppiG) as a set of minimal support 
sets of G for G G GP“‘*. 

2.2 PRISM Programs 

PRISM programs must satisfy the following conditions on facts E and the basic 
distribution Pp. 

1. A ground atom in E is of the form msw(z,n,ri). It is supposed to represent 

the fact that a multi-valued probabilistic switch named i yields a value v 
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at n-th sampling, v is taken from Vi, a finite set of ground terms specified 
beforehand. 

2. Put Vi = {v\,V 2 , ■ ■ ■ ,vk}- Then exactly one of msw(i,n,ui), msw(i, 71 ,^ 2 ), 

. . . , ms-nd ,n ,vk) always holds true. Put differently, Oi{v) = 1 always 

holds where 9i{v) is the probability of msw(i,n,u) (v € V) being true. We 
call 0i(v) a statistical parameter (or simply a parameter) of the program. 

3. For n yf n', msw(*,n,-) and mswCi.n',-) are independent and identically 
distributed (i.i.d.) random variables with common statistical parameter 9i. 
Also for i yf i' , msw(j,-,-) and mswCt',-,-) are independent. 

We note that if Vi = {1, 0}, msw(i ,n,u) coincides with a BS atom hs(i,n,v) 
in M- The second condition just says that the switch i takes a value v with 
probability 9i{v). 



2.3 A Program Example 

For an example of a PRISM program, let us consider an HMM M whose pos- 
sible states are {sO, si}, and whose possible output symbols are {a, b}. Our 
HMM follows the definition described in 0 (not in 0). The following is a pro- 
gram which represents M, where him (String) denotes the observable fact that 
String is a string sampled from M . The probabilistic behaviors of state transi- 
tion and output are specified by the switches of the form msw(tr(0 and 
msw(out(-) respectively. The length of a string is fixed to three. 



% 

target (himn, 1) . 
data( ’hinm. dat ’ ) . 
values (init , [sO , si] ) . 
values (tr(_) , [sO, si] ) 
values(out(_) , [a,b] ) . 

57 

strlenO) . 



- Declarations ’/. 

"/, Only himn(_) is observable. 

"/, Data are contained in ’hmm.dat’. 

"/, Switch ’init’ takes ’sO’ or ’si’. 

"/, Switch ’tr(_) ’ takes ’sO’ or ’si’. 

"/, Switch ’out(_)’ takes ’a’ or ’b’ . 

Model •/. 

"/, The length of a string is fixed to 3. 



hmin(Cs):- mswfinit .null ,Si) ,hmin(l , Si ,Cs) . /. Start from state Si. 



hmm(T, S, [C I Cs] ) : - strlen(L) ,T=<L, 
msw(out(S) ,T,C) , 
msw(tr(S) ,T,NextS) , 

T1 is T+1, 
hiran(Tl .NextS ,Cs) . 
hmm(T, _,[]): - strlen(L) ,T>L. 



’/. Loop : 

’/, Output C in state S. 

’/, Transit from S to NextS. 

"/, Put the clock ahead. 

’/. Repeat above (recursion) . 

’/. Finish the loop. 



2.4 Learning PRISM Programs 

Learning a PRISM program means MLE (maximum likelihood estimation) of 
parameters in the program. That is, given observations Gj (1 < t < T), we max- 
imize the likelihood of these atoms HtLi PosiGt = 1|^*) by adjusting parameters 
9 associated with msws in the program. The learning will be done by the following 
two-phase procedure: 
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1. Search exhaustively for S such that S G tposiGt), for each observation Gt- 

2. Run the EM algorithm and get the estimate of parameters 9. 

In the second step, the EM algorithm makes calculations based on the statistics 
from tpDB{Gt)- This section describes a detail of the second step. 

In what follows, we consider a set X of random variables as a random vector 
whose elements are X. Note that the realization of X is also a vector. Also 1 
(resp. 0) is used to denote a vector consisting of all Is (resp. Os). We now make 
some definitions for G G relv{G) is a set of relevant switches to G, namely 

Usgv-db(g)*^- Then define 

Sdb{G) relv{G) U | msw(i ,n,n) | n G Vi, 3v' (msvG,n,v') G relv{G),v' yf n) }, 
and Xdb = Ugggp“" ^db{G). For S G tpDB{G), put 

S~ |msw(i,n,w) \ v G Vi, 3v' {rns^d ,n ,v') G S,v' ^ v)} 

Srest SdB -{SVJS-) 



S can be seen as the complement of S, since S =6if5'=i./ and 0 respectively 
stand for the set of all switch names and the set of all parameters appearing in 

Xdb, i-e. / {z I msw(z,-,0 G Xdb} and 9 \ i G I,v G Vi} in 

notation. Let S be an arbitrary subset of Sdb- S is said to be inconsistent if 
there are msws such that msw(z,n,u), msw(i,n,uO G S and v ^ v' . If there is 
no such pair, S is consistent. Suppose S,S' C Xbb are consistent respectively 
but S' U S" is inconsistent. In such a case, they are called disjoint. We assume 
the following disjointness condition: 

Disjointness condition: For any S G tpDB{G) and G G S is consistent, 

and disjoint from S' such that S' G ipoBiG), S' ^ S. 

Under the disjointness condition and Pp’s third condition, i.e., \i i ^ i' or 
n ^ n' {i, i' , n, and n' are all ground terms), random variables msw(z,n,-) 
and msw(i',n',-) are independent of each other, the likelihood of G G G^““ is 
calculated by the followings: 

PDB{G=m^Y.si,^nBiG)P^is=m ( 1 ) 

= (2) 

where <Ti^y{S) is the count of distinct msw(z,-,z;)s in the support set S. Under 

the notation in j^, (Ti^v{S) corresponds to |S = i|„. 

To realize MLE of PRISM program, we first introduce Q function by 

Qienen,,e) ELi E. PdB iXDB=x\Gt = 1,9) log Pdb(SdB=X, Gt = l|6l„e»). (3) 

From the definition, it is straightforward to show that Q{9new,9) > Q{9,9) 
implies nLiT’DB(Gi = l\9new) > Hj=iPDBiGt = l\9). Our MLE procedure 
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learn-PRISM shown below is an EM algorithm making use of this fact. It 
starts with initial values and iteratively updates, until saturation, 
to such that (3(0*-’”“''^^ 0^™^) > to find 9 that maximizes 

Uj=lPDB{Gt = i\9). 

procedure learn-PRISM begin 

foreach i £ I ,v £Vi do /* Initialize the parameters: */ 

Select some 9f‘\v) such that 9f^\v) = 1; 

m := 0; 
repeat 

foreach i £ I ,v £ Vi do begin /* E(xpectation)-step */ 

for_^:= 1 to T do 

ONi(y) := aY*(u)/PnB(Gt = 

end; 

m := m + 1; 

foreach i, u do dY\v) := ONi{v)/ Yv'ev- ONi{v')\ /* M(aximization)-step */ 

until a'"*) - < e; /* Terminate if the log -likelihood saturates. */ 

end. 

In learn-PRISM, e is a small positive constant. By the definition oi Sdb and 
Brest, for i £ I, the count of distinct msw(i,-,B)s in Srest is constant for each 
V £ Vi, so it is written as 7 i(S'). The above algorithm is a general learning 
algorithm applicable to any PRISM programs, but the existence of the term 
■ji{S)9i(v) hinders efficient learning. We next replace Pp by a more specialized 
distribution to eliminate this term. 

2.5 Constructing PRISM* Programs 

Suppose our program DB = F £) R with the basic distribution satisfies the 
following uniqueness condition. 

Uniqueness condition: For any goal G £ G’s explanation does not 

explain other goals in That is, for G, G' € GP“‘* and G ^ G' , S ^ iPdb{G') 

if S' e tpDB{G). Also Egsgp“« PDB{G = l\e) = 1 holds. 

Then, under the disjoint and the uniqueness condition, it is possible to con- 
struct from DB a new program (let us call them PRISM* programs) DB* = 
F* U R with a new fact set F* and a new basic distribution Pp* such that 

1. Every atom in F* takes the form of msw(t,n,r;) and the range of v is V* 

Vi U {*}. Here Vi stands for a set of ground atoms assigned to i and * is a 
new constant symbol appearing nowhere else. 

2. Put V* = {vi, . . . ,vk, *}. Exactly one of msw(t ,n,wi), . . . , mstnii ,n ,vk) , 
msw(z,n,*) becomes true. 
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3. Define S* {msw(i ,n,*) | msw(i,n,0 G S'}, and {msw(z,n,*) | 

msw(j,n,-) G Srest}- For every S G iPdb{G),G G Gp®**, 

Pf* {S = X, S~ =X~ , S* =X*, Sreat=Z, S*est = Z*\0) 

(W r PF(S'=i|0) ii x = z* — i,x~ =x* = z = b , , 

1 0 otherwise. ' ' 

From the disjointness and the uniqueness condition, it follows that Pp* becomes 
a probability distribution. We also note that, in the third condition, Pp* is 
defined on the original PRISM program’s Pp and ipDB- 



2.6 Learning PRISM* Programs 

The EM learning algorithm learn-PRISM* for PRISM* programs is much sim- 

d©f 

pier thanks to the change of the basic distribution. Define = Sdb ^ 

I msw(i,7i,*) I msw(i,7i,-) G Sdb } and a new Q function (called Q* func- 
tion) : 

Q*iBne^, 9) '5^' E. Fdb- {S*oB=x\Gt = 1,9) log PoB* {S*DB =X,Gt = l|0„e»). (5) 

Similarly to learn-PRISM, we can derive learn-PRISM* which updates to 
Q{m+i) Q* O^P) > Q* ) ^ aud thus realizes MLE for 

a PRISM* program. By definition ofPRISM*,it is easy to show Pop(G= I|0) = 
Pdb*(G= 1|0) for any 9 and G G G^““, hence we can say learn-PRISM* also 
realizes MLE for the original PRISM program. Also, it is roughly proved in 
Annendix lA.'Zl that by definition of PRISM*, learn-PRISM* is just learn-PRISM 
with the term 'yi{S)0i{v) deleted which causes computational inefficiency. 



3 Graphical EM Algorithm 



This section introduces another learning procedure for PRISM programs. The 
new procedure makes use of a graphical structure for efficient learning. 

1. For each observation Gt, We first have an exhaustive search for S such that 
S G tpDB{Gt). While searching, using search techniques such as tabulation, 
we construct a data structure, called a support graph, for each Gt- 

2. We then run a graph-based EM algorithm, called the graphical EM algorithm, 
on the constructed support graphs, and get the estimate of parameters 9. 

In the rest of this section, we assume that the suitable support graphs are given, 
and consider the second step. It then can be proved that the learning algorithm 
for PRISM* programs (i.e., learn-PRISM*) and the graphical EM algorithm 
yield the same estimate of 9 (see Appendix |B1) . 
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3.1 Support Graphs 

The support graph for Gt (1 < t < T) is a triplet {Ut, Et,k), where Ut is a set of 
nodes, Et Ut x Ut is a set of edges, and k : Ut ^ ^ {Gt, □}) is called a 

labeling function. A support graph {Ut,Et, It) for Gt should satisfy the following 
conditions: 

— (Ut,Et) is a directed acyclic graph which has exactly one node (referred to 
as ul°^) that is not terminal node of any edges in Et, and exactly one node 
(referred to as that is not initial node of any edges in Et- We define 

U^°^y = Ut-{ul°^,u^t°*}- 

— Ut°^ and Ut°* have special labels, i.e., lt{ul°^) = Gt and lt{ut°*) = □. 

— In any path from to u\°*' , no switch occurs more than once, i.e., u, u' € 
nodes(r) A m yf m' hiu) 7 ^ It(u') for r € path{ul°^ ,Ut°*), where path(u,u') 
is a set of directed paths from u to u', nodes{r) is a set of nodes in the 
directed path r. 

— For r,r' G path{ul°^ , Ut°*) and r yf r', labels(r) and labels(r') are disjoint 

def 

from each other, where labels{r) = {lt{u) \ u G nodes{r)} — {Gt,D} for the 
directed path r in (JJt, Et,lt)- 

— f’DBiGt) = I 7T G where 

7T(^*’^‘)(u, m') |7T I 3r(r G path{u, u') , tt = nodes{r) — }, 

jjiUuEt) drf 

TT^ {lt{u) I U G tt} for 7T C Ut°'^^ . 

We sometimes omit the superscript (Ut,Et) and abbreviate 7 t^‘ as tt^ if it does 
not make a confusion. 



3.2 Graphical EM Algorithm 

Once we have constructed the support graphs for all observations, parameters 
are learned by the graphical EM algorithm. To specify this, we should add some 
definitions. parent{u) and child(u) refer to a set of parent nodes and that of child 
nodes of u, respectively. p{u, 9) is then defined for u G\J^Ut and 0: 

, def f 9i(v) if M G and lt(u) — msw(i,-,n) 

otherw4 

The graphical EM algorithm consists of a procedure learn-gEM and a function 
forward-backward. We prepare variables a{u) and /3(u) for each u G Ut, 1 < t < 
T, called forward probability and backward probability of node u, respectively. 

procedure leam-gEM begin 

foreach i G I ,v G Vi do /* Initialize the parameters: */ 

Select some 9f’\v) G (0, 1) such that = 1; 

for t := 1 to T do Pt := forward-backwardiUt, Et,lt,6^^^)', 

m := 0; 




A Graphical Method for Parameter Learning of Symbolic-Statistical Models 



271 



repeat 

foreach i, v do OTii(v) ;= 0; 

for t ~ 1 to T do begin 

foreach i,v do oni{v) := 0; 

while s yf 0 do begin /* E-step: */ 

Choose some u from s; 

if lt(u) = msw(i,- ,v) then onl{v) := onl{v) a{u)l3{u); 
s := s — {u}; 

end; 

foreach i,v do oni{v) := orii{v) on[{v)/Pt\ 

end; 

m := m -I- 1; 

foreach i,v do := om{v) on,i{v')-, /* M-step */ 

for t ~ 1 to r do Pt := forward-backward(Ut, 

until < e; /* Terminate if the log-likelihood saturates. */ 

end. 

tmiction forward-backward {Ut, Et,lt,d) begin 
foreach u £Ut do begin 

a{u) ;= undef; fdiu) := undef; 
end; 

a{u°^) := 1; := 1; s := child{u°^)-, s' := parent[u\°')\ 

while s 7^ 0 do begin /* Calculate forward probabilities for each node: */ 
Choose some u from s such that Vu' G parent(u)i^ a(u') 7^ undef ); 

a(u) {^2u' !=parent{u)^^^ )^p(w, 6^), 

s := (s U childiu)) — {u}; 

end; 

while s' 7^ 0 do begin /* Calculate backward probabilities for each node: */ 
Choose some u from s' such that Mu' G child{u)(^ /?(w') 7^ undef ); 

/3(w) - Eu'echtldiu) P{u')p{u' ,9)- 
s' := (s' U child{u)) — {u}; 

end; 

return a{ul°'); /* Return the likelihood. */ 

end. 

3.3 Learning HMMs 

Let us consider again the program in Section Fig. Q illustrates the support 

graph of an observed goal hmm( [a,b , a] ) . Each node in the graph is labeled 
with msw(-,-,-), hmm( [a,b,a] ), or □. Given such a support graph {Ut, Et,lt), 
learn-gEM and forward-backward takes 0{\Et\) of calculations in each repeat 
loop. So, letting N be the number of states, and L the length of a string, the 
graphical EM algorithm takes O(N'^L) of calculations for each parameter up- 
dating. As described in jSj, Baum-Welch algorithm also takes O(N'^L), so it is 
concluded that the graphical EM algorithm can simulate Baum-Welch algorithm 
via the HMM written as a PRISM program. 
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msw (out ( sO) , 1 , a) msw (out (si) , 1 , a) 




msw(out (sO) , 2 ,b) msw (out (si) , 2 ,b) 




msw (out ( sO ) , 3 , a) msw (out (si) , 3 , a) 




□ 



Fig. 1. The support graph with the goal hmm ( [a , b , a] ) . 



4 Related Works 

So far, many probabilistic extensions of logic programs have been proposed. We 
here mention some of related works briefly (In iiil . we have also mentioned other 
related works such as EE). Muggleton’s stochastic logic programs (SLPs) 0 
combine probabilities with first-order logic programs, but no mention is made 
about the parameter learning. In Riezler’s probabilistic constraint logic pro- 
gramming inj and Cussens’s loglinear models using SLPs 0, the probability 
distribution (a loglinear distribution) is defined on the set I? of proof trees. It 
is of the form Pr{uj) = exp(^ ■ Ai/i(w)) for a; S 17, where each fi is the 
feature of a proof tree and Xi is the parameter. The problem in their framework 
is that the adoption of 17 as a sample space could be an obstacle to the logical 
or procedural treatment of negation. Besides, the calculation of a normalizing 
constant Z is generally intractable, and Riezler proposed no polynomial learning 
algorithm for a specialized class of models such as HMMs. In contrast, in PRISM 
modeling, we can enjoy efficient learning as shown in this paper, though users 
must write programs so that probabilities of all observable ground atoms sum 
up to one, which eliminates the needs of normalization. There seems to be a 
trade-off between the efficiency of learning and the burden of programming on 
the user. 

5 Discussion 

We have presented a new framework for the modeling language PRISM, in which 
efficient parameter learning is achieved for a certain class of symbolic-statistical 
models. We showed that, if the objective model can be represented by a PRISM 
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program satisfying two conditions (the disjointness and the uniqueness condi- 
tion), then we can construct the corresponding PRISM* program for which an 
efficient learning is possible. A user has only to write programs so that these two 
conditions are met. In reality, we can say that they are not too restrictive in the 
sense that PRISM programs which represent widely-known symbolic-statistical 
models like HMMs, BNs, and PCFGs satisfy these conditions. 

We then presented a graph-based EM algorithm (the graphical EM algo- 
rithm) which runs on a graphical data structure (a support graph) of a PRISM 
program for each observation. We have roughly shown that, for a given PRISM 
program, the EM algorithm for the (automatically-constructed) PRISM* pro- 
gram and the graphical EM, given appropriate support graphs, yield the same 
estimate. This justifies to use the graphical EM instead of the old learning al- 
gorithm for PRISM programs used so far, which is general but computationally 
inefficient. We also showed that, if appropriate support graphs are given for an 
HMM, the graphical EM only requires the same computational complexity as 
Baum- Welch algorithm. This implies that the graphical EM is a generalization of 
Baum- Welch. Similarly, it is straightforward to show the algorithm which finds 
the most likely minimal support set (or the most likely explanation) for an ob- 
servation. The algorithm, omitted due to the space limitation, can be considered 
as a generalization of Viterbi algorithm P3, also developed for HMMs. 

There remains a lot to be done for our new framework. Generating such a 
support graph is still a problem, so we are currently planning to adopt search 
techniques such as tabulation. We also need to study of the computational re- 
lationship between the graphical EM and other specialized EM algorithms, e.g., 
Inside-Outside algorithm ^ for PGFGs. 
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A Derivation of learn-PRISM and learn-PRISM* 



A.l learn-PRISM 

To derive learn-PRISM, we transform Q function. We hereafter consider 0 log 0 = 
0, and abbreviate Pdb{^db = x, . . .) as Pdb{x, ■ ■ ■) and Pj)B{Gt = l\9) as T*. 

Q{dnew, 9) = Si Pdb{x, Gt = V\0) log Pdb{x, Gt = l\0„ew) 

= St ^ Esmgt S. Pob(S= i, S-=6, Sr.st=z, Gt = m ■ 

log PdB {S=i, S~ =0, Srest =Z,Gt = l\0new) 

= St Ss S. Pp (S=i, Srest = z\6) log Pf{S = 1, Srest = z\9nen,). 



In addition, under the condition of Pp, for each S £ ifoBiGt) (1 < t < T), S 
and Srest are independent of each other. So, we have 

Q{9re^,9) = J2^^Y.sPp(S=i\9)- 

|log Pf(S'= i|0„c™) + Sz PF{Srest = z\9) \og Pp {Srest = z\9 new)^ (6) 

Let |S'rej 5 t = z\r be the count of equations msw(f,-,r;) = 1 which occurs in 

Srest = z. Then, similarly to Eq. 0 Pp{Srest = z\6new) = Tit.,, 

holds. Substituting this and Eq.^into Eq.0 the content of {•} in Eq.0becomes: 

S.6r,„ev, + = 1^-* = APog9[{v) 

= + Y.rPpiSrest=z\9)\Srest = z\r) \0g9[{v). (7) 

Let denote a set of msw(*,-,w)s included in Srest {Srest will be divided to 
the disjoint sets {i G I,v G Vi)), and let Zi^y denote the realization of 
Then, from the fact that \ii ^ i! or v ^ v' then |S'*gs’^( = = 0, and that 

distinct S'J.ggj {i G I,v G Vi) are independent of each other, the following holds: 



J2^Pp{Srest=z\9)\Srest = z\r = Pp {S):Z, = Zi,r\9)\S::Z, V 

The right-hand side can be considered as the expectation of the count of switch 
i taking v, in which switch i is sampled for "fi{S) times under the parameter 

9^, so it equals 'yi{S)9^{v). That is, J2z Pp{Srest = z\9)\Srest = z\y = ji{S)9i{v) 
holds. Here we get the following inequality: 

Q{9re^, 9) = i Pf(S= i|0) + 7.(S)0.(u)) \0g9{{v) 

= (E* it Es = i|0) ) log 9{{v) 



= E 



iGl,vGVi 



ON.{v,9)hg9'{v) < ON,{v,9)logj^-^^^-^ff^^, 
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where ONi{v,0) ELi K ^SGV-DB(Gt) + 7*(5)6»*(w)}. Re- 
placing 6 and 6new by and respectively, the update := 

yields Q(6»(™+b, so 
this update of 9 realizes MLE of the PRISM program. Here it is obvious that 
learn-PRISM realizes the above procedure. □ 



A.2 learn-PRISM* 

Similarly to the derivation of learn-PRISM, we proceed to transform Q* function. 
Due to the space limitation, we abbreviate Srest as Sr, and note that Pdb* {Gt = 
1\9) = PDB(Gt = l\9) (=Pt). 

Q* {9 new, 6) 

~ = l X/a: Rob* i^DB Gt = V\6) log Pdb'» {E*Qg=X, Gt = V\6new) 

= ELi ^ Esmg.) (5= i’ =0’ = 0. 5. = 6, = i, Gt = m ■ 

log Pdb* {s^ i, S'- =6, s* =6, s, =6, s; = i, Gt = i\9new) 

= ELi Es Pp* (S= i, S- = 6, s* = 6, s. = 6, s; = i|0). 

iogPF-(s=i,s-=6,s*=6,s.=6,s; = i|6i„e») 

= Y.tl^Xs^i,g,iG,)P^iS=i\9)\ogPB{S=i\9new). (8) 



Note that the right-hand side of Eq. |H1 equals to that of Eq. 0 with a term 
J2z Pp(^rest = z I log PF(5'rest = z\9new) = 0. As described in the previous 
section, this term results in 7t(S)0t(u), so the obtained algorithm, learn-PRISM* , 
is just learn-PRISM with ^i{S)9i{v) = 0. □ 



B Equivalence of learn-PRISM* and learn-gEM 



In this section, we present an outline of the proof of equivalence of learn-PRISM* 
and learn-gEM. We first consider the support graph {Ut,Et,lt) for the goal Gt- 
After executing forward-baekward(Ut, Et,lt,9), the followings hold for u € Ut- 



a(u) = \ 

a(u)P(u) = Y.nen,uenPp('^‘ = ^W’ 



,9) 



if M = Pp 

if M G child[uP) 

otherwise, 



if M or u G parent{ul°*) 

otherwise, 



(9) 

(10) 

( 11 ) 



where nf*'^*\u) ,u), and 

The proof is done by induction on the structure of {Ut, Et,lt). 
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Then, the return value of forward-hackward{Ut, Et, It, 6) is equal to the likeli- 
hood PosiGt = because, from definition of support graphs and a{u) above, 

=1 

= E.en = i|^) = i|^) = PuBiGt = 1\9). 

To show that learn-PRISM and learn-gEM starting from the same initial param- 
eters yields the same estimate, it is enough to show that, just before param- 

eter updating, the value of OAt (u) in learn-PRISM and on^ (v) in learn-gEM are 
equal for 1 < t < T, t S /, u S T/j. 

OTliiv) = E a{u)P{u) ^from while loop of learn-gEM^ 
uec/t.it(ii)=nisw(i,- ,v) 

= E E 

ueUt,h(n)=mS^G ,■ ,V) 7rGi7,«evr 

^ Pr(n‘ = m = ^ ^ PF(7r‘^m 

ueUt,h('u)=msw(i,- ,v) ,-iren,ueTv if(u)=msw(i ,■ ,v) ,usn 

= ^Pf.(V = i| 0 ) 1 = ^ PF(5'=i|0)<7i,„(S) = Oiv‘(n). 

!t(u)=msw(i, • ,n) .ttSTT setfinslGt) 

' V ' 



□ 
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Abstract. This paper describes a parallel algorithm and its implemen- 
tation for a hypothesis space search in Inductive Logic Programming 
(ILP). A typical ILP system, Progol, regards induction as a search prob- 
lem for finding a hypothesis, and an efficient search algorithm is used 
to find the optimal hypothesis. In this paper, we formalize the ILP task 
as a generalized branch-and-bound search and propose three methods 
of parallel executions for the optimal search. These methods are imple- 
mented in KLl, a parallel logic programming language, and are analyzed 
for execution speed and load balancing. An experiment on a benchmark 
test set was conducted using a shared memory parallel machine to evalu- 
ate the performance of the hypothesis search according to the number of 
processors. The result demonstrates that the statistics obtained coincide 
with the expected degree of parallelism. 



1 Introduction 

While Inductive Logic Programming (ILP) has greater expressive power as a 
relational learning framework than do propositional learners, the learning speed 
of ILP is relatively slower due to the tradeoff between expression and efficiency. 
A typical ILP system, Progol, formalizes induction as a hypothesis space search 
problem to obtain optimal hypotheses with maximum compression, and pro- 
vides efficient implementation of the optimal search |0. However, the number of 
hypotheses generated during the search is still exponential; there is no effec- 
tive method for constraining a general hypothesis search. A new implementation 
scheme from computer science literature is worth considering to realize efficient 
ILP systems. 

We attempted a parallel execution of an ILP system based on the framework 
of parallel logic programming to overcome the ILP efficiency problem. We first 
propose a hypothesis search method formalized as a generalized branch-and- 
bound search method and describe a parallel algorithm of the search, whereby 
three types of parallel processes within the algorithm are employed. We then 
implement the algorithm using a parallel logic programming language, KLl; the 
KLl program is compiled to the corresponding C program by KLIC0. We also 
show how load balancing is accomplished for the three processes using KLl, and 
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we estimate the degree of parallelism from the performance statistics in a single 
processor environment. Finally, to justify this estimation, we demonstrate that 
the experimental result of a benchmark test set coincides with the expected de- 
gree of parallelism, using a shared-memory machine with six processors. Though 
recent attempts to scale up ILP are comprised of samplings related to statistics 
and connections to database technology, our attempt provides an alternative 
means, using the parallel computation framework in logic programming. 

2 A Hypothesis Search in ILP 

To simply explain a hypothesis space search problem, we suppose that a set 
of positive examples a set of negative examples (E~), and background 

knowledge(i?) are collections of ground terms. An ILP system sequentially pro- 
duces hi to satisfy the following definite clauses: 

hi A ... A hr, A B \= E+ 
hi A ... A hn A B A E~ ^ □ 

E~ is substituted to a subset of all negative examples to compensate for 
noise, hi is called a hypothesis in the following. 

An ILP system, Progol, constructs a bound hypothesis space using the input- 
output modes of predicate arguments and the depth of variable connectivity. 
Such a space forms a lattice, the bottom of which is the most specific clause, 
which explains the example e € E'^ . For each hypothesis hi and hj in the lattice, 
hi is an upper-level hypothesis of hj, if a substitution 9 exists such that hiO C hj. 
In this case, hj is a more specific clause of hi. We denote Refine{h) as the set 
of more specific clauses of h. 

We examined a learning task for classifying e-mail priorities as an example 
of hypothesis space. In this task, the priority of a received e-mail is either high, 
normal, or low, and that fact priority (A, X) indicates that the mail A is clas- 
sified into X. Figure 0 shows a hypothesis space in this task. Let the priority 
of the first mail ml be normal. By substituting constant ml to a variable (say 
A), the most general clause becomes priorityCA, normal), which means that all 
mails are classified into normal. In contrast, the most specific clause (the bottom 
clause) can be constructed by collecting all information associated with the mail 
ml from the background knowledge. Suppose that the background knowledge 
includes the mail sender information denoted by receive Jrom_user, subject 
information by subject, and words in the subject by in_word. In this case, the 
most specific clause is obtained as shown in Fig. Q 

An ILP system finds a hypothesis from this space. The ideal hypothesis covers 
as many positive examples as possible and as few negative examples as possible. 
Let p{h) and n{h) be the numbers of positive and negative examples covered by 
hypothesis h. The number of literals in h is denoted by c(/i). We express it as 



g{h) =p{h) -c(h), 
f(h) = g{h) - n{h) 
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Fig. 1. A hypothesis space. An enclosed clause is a hypothesis. A pair asso- 
ciated with a hypothesis is the numbers of positive and negative examples the 
hypothesis covers. 



where g{h) indicates the generality of h and f{h) indicates the compression 
measure, as used in Progol. g{h) monotonically decreases for a top-down hy- 
pothesis search, while f{h) is non-monotonic. Progol finds hypothesis h with 
maximum compression in which n{h) = 0, and there is no hypothesis h' such 
that f{h) > g{h'). However, noise must be considered. Suppose that a small 
number of negative examples (n*) is allowed for noise. Even if Progol finds a 
hypothesis (ft.) that satisfies n(ft) < n* , Progol must search for the optimal 
hypothesis when hypotheses more specific than ft must be explored. 

In contrast, we deal with the permissible number of negative examples cov- 
ered by a hypothesis as a constraint in a search problem. Moreover, we adopted 
the generality of ft as an objective function, yielding a monotonic search. The 
compression measure /(ft) is used to select a hypothesis to be expanded for 
further hypothesis exploration. Thus, the compression measure is not a criteria 
for optimality. It is used for search control, and we extended it to the following 
weighting function: 



/(ft) = (1 - X)g{h) - Xn{h) 

where A controls the order of hypothesis exploration. It is easy to satisfy the 
above search constraint if A is greater. A normal branch-and-bound search is 
realized for A = 0. The next section provides a parallel algorithm based on this 
weighed branch-and-bound search. 
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3 Parallel Algorithm 

Parallel execution of the above branch-and-bound search can be performed by 
parallelizing the following search processes: 

1. Parallel hypothesis exploration for independent hypotheses 

There are independent hypotheses for each priority in a classification prob- 
lem such as the e-mail priority classification. The degree of parallelism for 
an independent hypothesis search increases significantly if there are many 
categories to be classified. 

2. Parallel hypothesis branching 

A hypothesis is refined by more specific hypotheses in a top-down hypothesis 
search. Thus, each specific hypothesis can be explored in parallel. If the 
number of literals in the bottom clause is very large, the degree of parallelism 
becomes high. 

3. Parallel counting of f{h) and g{h) 

The computation cost for f{h) and g{h) depends on the cardinality of A+ 
and E~ . Each example can be independently tested to determine if it can 
be subsumed by a hypothesis (h). If there are greater numbers of elements 
in E+ and E ■ , we expect a higher degree of parallelism. 

We use the following example to show the three parallel processes within 
the search algorithm. Local variables are inserted in the arguments of the pro- 
cedures, and the other variables are global. The notation “in parallel” indicates 
the parallel execution of statements. 

Suppose that there are n categories for the classification problem and each 
learning target is denoted by SubConcept[i]. P[i] and 7V[t] are sets of positive 
and negative examples for each SubConcept[i]. The top-level learning procedure 
is described as follows: 

Induce 

1 

2 for each SubConcept[i], in parallel 

3 do H H U CoverSet{E{i),%,i) 

The variable H is instantiated to a set of hypotheses. The function CoverSet 
in Line 3 produces multiple hypotheses for each SubConcept[i]. The function is 
defined as follows: 

CoverSet{E, Hsub, i) 

4 while if yf 0 

5 do let e be a first element of E. 

6 Hsub ^ Hsub'J {Branch{e, {0}, e, z)} 

7 E^E-{e' €E\B A H^ub b e'} 

8 return Hsub 

After selecting an example e € if in Line 5, a set of hypotheses are produced by 
calling the procedure Branch in Line 6. In Line 7, positive examples covered by 
the produced hypotheses are deleted from E. 
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The procedure Branch realizes a branch-and-bound search. In this case, the 
branching process generates more specific hypotheses specified by the function 
Refine{h), and the bounding process deletes irrelevant hypotheses. Let Active 
be a set of non-branched and non-terminated hypotheses, and Incumbent be h 
with a maximum g(h) such that n(h) < n* . These parameters are inserted in 
the second and third arguments in the procedure Branch, and the initial values 
of the parameters are {0} and e. The definition of Branch is as follows: 

Branch{e, Active, Incumbent, i) 

9 while Active yf 0 

10 do select h from Active using f{h). 

11 Active ^ Active — {h} 

12 for each d G Refine{h), in parallel 

13 do Bound{d,i) 

14 return Incumbent 

In Line 10, a hypothesis h is selected from among Active, based on the function 
f{h). The bounding process is invoked in parallel for each specific hypothesis 
d G Refine(h). 

The procedure Bound determines whether a hypothesis h should be further 
refined or not. This procedure calculates the values of f{h) and g{h). The current 
incumbent might be updated if n{h) < n* . Otherwise, h is put into Active. Thus, 
the procedure can be defined as follows: 

Bound{d, i) 

15 if Neg{d,0,i) < g* 

16 then {updating the incumbent} 

17 if Neg{lncumbent,0,i) > Neg{d,0,i) 

18 then Incumbent <— d 

19 else 

20 if Pos{d,0,i) > Pos{lncumbent,0,i) 

21 then Active <— Active — {d} 

Such a hypothesis is meaningless if the condition in Line 20 is satisfied; and thus 
the hypothesis is removed from Active. The values of f{h) and g{h) are calculated 
within the procedures Pos and Neg. This can be performed for each positive and 
negative example of SubConcept[i]. Thus, the procedures are defined as follows: 

Pos{d, f,i) 

22 for each e G P{i), in parallel 

23 do if i? A Hsub k e 

24 then / ^ / + 1 

Neg{d,g,i) 

25 for each e G N{i), in parallel 

26 do if i? A Hsub k e 

27 then g ^ g + \ 
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Fig. 2. Three parallel processes 



Figure ^illustrates how the above parallel algorithm works during a hypoth- 
esis search. The procedure Induce has the largest granularity of the parallel 
processes, while the procedures Pos and Neg have the smallest. The degree of 
parallelism for each process is influenced by such granularity, as shown in Section 

5. 

4 Implementing the Search Algorithm in KLl 

The proposed parallel search algorithm is implemented in a parallel logic pro- 
gramming language, KLl. KLIC compiles data from a KLl program to the 
corresponding C programs. A KLl program consists of the following guarded 
Horn clause: 



where H and Bj are atoms, and Gi is a built-in predicate cell. H, Gi, and Bj are 
called head, guard, and body. Although head matching and guard computations 
are performed sequentially, each body is solved as a goal executed in parallel. A 
processor ID can be specified for each body goal to be executed in that processor. 
In KLl, a processor is annotated within a goal by 

Bj@node{k) 

where k is the ID of the processor. 

This processor assignment enables a parallel algorithm described by 

for each A G 5", in parallel 
do p(S') 

to be easily translated into the following KLl program: 
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1 for_each(S, * ) :- 

2 current_node(Node,Nodes) , 

3 do(S, Node, Nodes, * ). 

4 do( * ) . 

5 do( [AlRest] , Node, Nodes, * ) :- 

6 Node =\= 0, 

7 NextNode := (Node+1) mod Nodes I 

8 p(A, * )@node(Node) , 

9 do(Rest, NextNode, Nodes, * ). 

10 do ( [AlRest] ,0, Nodes, * ) 

11 p(A, * )@lower_priority, 

12 do (Rest , 1 ,Nodes , * ). 

where * represents an abbreviation of the n-ary arguments. The predicate cur- 
rentmode in Line 2 is a built-in predicate supported by KLIC; the predicate 
returns the processor Id of the program it is executing to the first argument, 
and returns the number of available processors. A processor ID is a positive 
integer starting from zero. A special case exists when there is a processor zero 
and the above program works in this processor. Lines 10-12 are for this special 
case, and an annotation lower_priority specifies that the execution priority of 
the associated goal p(A, * ) in Line 11 is lower than the other goals. Thus, the 
goal in Line 12 is executed with higher priority, and the goal in Line 11 can be 
dispatched into other processors before executing the goal in Line 1 1 . 

5 Analysis 

The logical degrees of parallelism for the above cases of parallel execution are 
summarized as follows: 

Induce The number of SubConcept 

Branch The average number of \Refine{h)\ 

PostNeg \E+\ + \E~\ 

We compared the above numbers with the real number of parallel processors. 
We used a shared-memory parallel machine with six processors (Enterprise 3000, 
developed by Sun Microsystems). The degree of parallelism in the third case is 
much larger. The second case depends approximately on the number of literals 
in the bottom clause; thus the degree of parallelism is determined by the amount 
of background knowledge. The first case indicates the number of categories to be 
classified, and its degree of parallelism is relatively small. The parallel execution 
of PosEzNeg is uniformly distributed, while Induce and Branch strongly depend 
on a given ILP problem. It is particularly hard for Induce to balance the load 
for each processor. 

We then considered the granularity of the distributed computation. Obvi- 
ously, goal dispatching to other processors cannot be ignored. If a fine-grained 
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goal is dispatched, the dispatching duration becomes overhead for the total ex- 
ecution. This case may be found in PosEzNeg. 



6 Experiment 

The objective of the experiment is classifying e-mail, as described in Section 
2. The experiment was conducted on a typical dataset reported in |0. Table 1 
shows the statistics of our ILP system working on a single processor machine. 
Case 3 has the largest examples and Case 2 has the largest amount of background 
knowledge. The item Total represents the learning time. The items Induce and 
Pos&Neg are the average execution times for one process. Since Induce is exe- 
cuted for each classification category, we show its maximum and minimum time 
of execution for one process. The item Pos&Neg is the average cost of a sub- 
sumption test, namely 



P A H g'djj \~ e. 

The value of Pos&Neg is greater in Case 2, because most of the hypotheses 
produced have many literals in their conditional parts. The item Branch is the 
average number of literals in the bottom clause. The greater amount of back- 
ground knowledge in Case 2 yields a longer bottom clause. 



Table 1. Statistics of induction on a single processor machine 



Case 


|Pos| 

+|Neg| 


Total 

(sec.) 


Induce 

(sec.) 


Branch 


Pos&Neg 

(/isec.) 


1 


142 


9.8 


0.1~3.3 


6.3 


60.7 


2 


60 


62 


0.3~44 


27.6 


154.7 


3 


718 


182 


0.1~95 


6.0 


64.5 



Figure!^ shows the effect of increasing the induction time on different num- 
bers of processors. Obviously, the case of Induce cannot be uniformly dispatched. 
There was no speeding-up effect realized for Case 2 in particular after increasing 
the number of processors. Branch indicated good dispatching for Case 2, be- 
cause the bottom clause has many literals and the branching factor was greater. 
In contrast, Pos&Neg is independent of the problem, and therefore the load 
is well-balanced for all the cases. The result induced in Fig. 3 coincided with 
the theoretical analysis described in the previous section. This implies that we 
can estimate the effect on a number of processors by measuring the execution 
statistics on a single processor machine. 
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Fig. 3. The effect on the number of processors. The straight line indicates the 
ideal performance. Note that super-linear effects can be found in Case 2. This is 
due to garbage collection. 

7 Related Work 

Parallelizing an ILP engine is important to the efficiency and scaling up the 
ILP. A large amount of data must be processed efficiently for the data mining 
application of ILP in particular. Recent attempts to resolve this problem have 
focused on a sampling method based on statistics Q and on information decom- 
position by checking the independency among the examples and background 
knowledge |T|. Since these are ILP-specific methods, they are not competitive 
with our method; these methods can enhance the effect of parallel execution for 
a hypothesis search. A first step to parallel ILP can be found in 0, where perfor- 
mance issues were not discussed. A parallel version of FOIL has been attempted 
in In contrast, our method proposes three types of parallel execution, and 
indicates the performance improvements for different numbers of parallel pro- 
cessors. 

8 Conclusions 

This paper presented an efficient hypothesis search method for ILP using parallel 
logic programming. Our experiment demonstrated that a theoretical analysis 
and the resulting performance were in accord. This implies that the degree of 
parallelism in multiple processor machines is predictable from the execution 
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statistics from a single processor machine. Since the analysis and the experiment 
in this paper focused on the execution time, future work will include computation 
space and linkage to database systems. 
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Abstract. This paper shows that a connectionist law discovery method 
called RF6 can discover a law in the form of a set of nominally con- 
ditioned polynomials, from data containing both nominal and numeric 
values. RF6 learns a compound of nominally conditioned polynomials by 
using single neural networks, and selects the best one among candidate 
networks, and decomposes the selected network into a set of rules. Here 
a rule means a nominally conditioned polynomial. Experiments showed 
that the proposed method works well in discovering such a law even from 
data containing irrelevant variables and a small amount of noise. 



1 Introduction 

The discovery of a numeric law such as Kepler’s third law T = from data 

is the central part of scientific discovery systems. After the pioneering work 
of the BACON systems |Hj, several methods mm have been proposed. 
The basic search strategy employed by these methods is much the same: two 
variables are recursively combined into a new variable by using multiplication, 
division, or some predefined prototype functions. These existing methods suffer 
from combinatorial explosion in search, preparation of appropriate prototype 
functions, and lack of robustness pni- 

A connectionist approach has great potential to solve these problems, and a 
connectionist approach called RF5 has been proposed. RF5 smartly dis- 

covers a single numeric law from data containing only numeric values; however, 
in many real fields, data usually contain nominal values as well. For example. 
Coulomb’s law F = giq2/47rer^ depends on e, the permittivity of the surrounding 
medium; i.e., if substance is “wateF, then F = 8897.352gi(72/?’^, if substance is 
“aiF , then F = 111.280gi(72/r^, and so on pj. Thus, Coulomb’s law is nominally 
conditioned. This type of law discovery problem was addressed by Bacon. 4 p], 
ABACUS PI, IDS PI and so on. Although each has its own advantages, as for 
the discovery of numeric laws all of them have the drawbacks stated above. 

This paper shows that a connectionist law discovery method RF6, the succes- 
sor of RF5, can discover a set of nominally conditioned polynomials, from data 
containing both nominal and numeric values. Section 2 explains RF6, i.e., its 
basic framework, numeric representation of nominal conditions, a connectionist 
problem solving, criterion for network selection, and a decomposing procedure 
into a set of rules. Section 3 evaluates RF6 by using two data sets. 
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2 RF6 Method 



2.1 Basic Framework 



Let (gfi , • • • , qxi , xi , • • • , xk 2 j y) be a vector of variables describing each example, 
where qk is a nominal explanatory variable, Xk is a numeric explanatory^ variable 
and y is a numeric target variable. Let Lk be the number of different categories 
in qk- Here, by adding extra categories, if necessary, without losing generality, 
we can assume that qk exactly matches the only one category. Thus, for each qk 
we can introduce a dummy variable qki defined by 

_ J 1 a qk matches the /-th category, . 

[ 0 otherwise, 

where I = 1, • • • , Lfe, and Lk is the number of distinct categories appearing in qk- 
In this paper, as the true model governing data, we consider a set of multiple 
rule|3 each of which has the following conjunctive nominal condition, 

if A <lki = l then y = (2) 



where denotes a set of dummy variables corresponding to the r-th nominal 
condition, and R is an unknown integer corresponding to the number of rules. 
To respond uniquely to any situation, we assume the nominal conditions defined 
by Q^^fr = 1, • • • ,i?, exclusively cover all the space. On the other hand, as a 
class of numeric formula /, we consider a generalized polynomial expressed by 





K2 (, 




AE^’n-r' 




II 


j(r-) 


/ K2 


(r) 1 (r) 

= Wq +Y'^) 


exp 1 Y^ '^ik ^k 


i=i 


\k=l 



/ K2 \ 

exp I Y ^jk Inxfc I , (3) 



where each parameter or is an unknown real number, and is an 
unknown integer corresponding to the number of terms. The last equation re- 
quires = 0. is a parameter vector constructed by arranging parameters 
w'fKj = O,-- •, JM, and = 1, • • • , ^ = I,-’ • ,^ 2 - 

Note that Eq. can be regarded as the feedforward computation of a three- 

fr) 

layer neural network; that is, Wj/J is a weight from input unit k to hidden unit 

(r) 

j, Wj ^ is a weight from hidden unit j to the output unit, the activation function 
of each hidden unit is exp(s) = e'*, and the number of hidden units is |T3l . 

^ An explanatory variable means an independent variable. 

^ Here a rule “if A, then B” does not mean material implication used in logic, but it 
simply means “when A holds, apply B” as used in a production system. 
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This type of hidden units is called product units |3- Serious difficulties have been 
reported when using standard BP El to train networks containing these units 
[ 7 ]- Although some heuristic strategies such as multiple learning algorithms have 
been proposed 0, their improvements over BP have been less than remarkable. 
In order to efficiently and constantly obtain good results, this paper employs a 
second-order learning algorithm called BPQ by adopting a quasi-Newton 
method 0 as a basic framework, the descent direction is calculated on the basis 
of a partial BFGS update and a reasonably accurate step-length is efficiently 
calculated as the minimal point of a second-order approximation. 

2.2 Numeric Representation of Nominal Conditions 

Here we introduce a framework to learn nominal conditions from data. Consider 
a function g defined by 



g(q- = exp 




(r) 



( 4 ) 



where denotes a vector of parameters v^i\k = 1, • • • , Ki, Z = 1, • • • , Lk- For 
a nominal condition defined by Q^'^\ we set the values of as follows: 



V 



(r) _ 

kl ~ 



0 

-(3 if qki i Q^"'\ qw e for some 1), 
0 if qkV 4- for any 



( 5 ) 



where /3 is a large positive constant. Then, g{q;V^^'^) is almost equivalent to 
the truth value of the nominal condition defined by i.e., g{q; = 1 if 

q satisfies the nominal condition; otherwise, g{q; < exp(— /3) « 0 because 

there exist some fc, I and I' (yf 1) such that qu = 1, qu ^ and qw G 
Thus, we can obtain the following relationship 

/ifi Lt, K2 

j^O \k^l 1^1 k^l 

f{x;W^^^) if f\ %, = 1, 

(6) 

0 otherwise. 




Therefore, the following formula can closely approximate to the final output value 
from a set of multiple nominally conditioned polynomials, defined by Eq. ( 0 . 

R 

F{q, x;^)=J2 9{q; f{x; W«). (7) 

r— 1 

Here ^ consists of all parameters W^"^\r = 1, - ■ ■ ,R. 
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2.3 Learning as a Single Network 

Let {{q^, x^, = I, - ■ ■ ,N} be a set of training examples, where N is the 

number of examples. By preparing an adequate number J of hidden units, the 
following formula can completely describe Eq. o, 

./ / Lfc K2 \ 

y{q,x;0) = Wq + E Wj exp EE qkl ^ ^ k In Xk I , (8) 

j=l \fc=l 1=1 k=l ) 

where 0 is a vector constructed by arranging all parameters wj,j = 0, • • • , J, 
’>^jki,j = 1, • ■ ■ , J, fc = 1, • • • , Ki , ; = 1, • • • , Lfc, and Wjk,j = 1, - ■ ■ ,J,k=l, - ■ ■ ,K 2 - 
Let M be the order of 0. Here, note that Eq. can also be regarded as the 
feedforward computation of a three-layer neural network. 

The equation (01 is more general than Eq. (Q in the sense that the weights 
Vk in Eq. (0) can have different values for each hidden unit j. Here the value of 
satisfying Eq. (0) is no more limited to Eq. (0. In fact, any value setting 
satisfying the following will do. 

[ if Qkl ^ Q^"'\ 

^ 1 ^ ^ for some l'{^ 1), ( 9 ) 

[ if m' i for any If 

(r) frl 

where and 'jjfJ should satisfy the following equation: 



= Wj exp Y, + E ^jk ■ (10) 

\fe:39fciGQM k-.VqkitQ^’-^ ) 

Moreover, if we have the same term exp(^j, Inxfc) in different rules r and 
r' in Eq. (0), it can be merged in one hidden unit j in Eq. (0). Thus, the values 
of V obtained after learning will be complicated due to these reasons. 

As for the learning of neural networks, it is widely known 0 that adding some 
penalty term fl{0) to an error function E{0) can lead to significant improve- 
ments in network generalization. Moreover, in order to improve the readability 
of discovered laws, values of irrelevant weights should become small enough; 
some penalty terms are expected to work well for this purpose. In our own 
experiments, the combination of the squared penalty (weight decay) term and 
the second-order learning algorithm drastically improves the convergence per- 
formance in comparison to the other combinations, at the same time bringing 
about excellent generalization performance m Thus, a simple penalized target 
function is given as below. 



E{0) + \n{&) 



N / M 



( 11 ) 



where A is a penalty factor and dm G 0. 
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However, the simple weight decay is known to be inconsistent with some 
scaling of variables p. In fact, when all the weights are to be penalized, it can 
be shown that we need a distinct penalty factor for each weight to guarantee 
the consistency with such scaling as xu = auXk or y = cy + d. This may cause 
serious difficulty in learning. Fortunately, when the squared penalty is confined 
to the weights from an input layer to a hidden layer, we can show that even 
a simple penalty factor guarantees the consistency (see the Appendix). Hence, 
the discovery of laws subject to Eq. can be defined as the following learning 
problem in neural networks. That is, the problem is to find the 0 that minimizes 
the following objective function 



£(0) = E(0) + A ( i t E E ^ E E -I. • (12) 

y j=i k=i i=i j=i k=i j 

2.4 Criterion for Network Selection 

In general, for a given set of data, we know neither the optimal number J* of 
hidden units nor the optimal value A* of penalty factor in advance. We must 
thus consider a criterion to adequately evaluate the law-candidates discovered 
by changing both J and A. 

In RF6, we adopt the procedure of cross-validation, frequently used for evalu- 
ating the generalization performance of learned networks Q ■ Here the generaliza- 
tion means the performance on new data; thus, for this purpose we need test data, 
which should be independent of training data. The procedure of cross-validation 
divides the data D at random into S distinct segments (G^, s = 1, - ■ ■ ,S), and 
uses S' — 1 segments for training, and uses the remaining one for the test. This 
process is repeated S times by changing the remaining segment, and the gener- 
alization performance is evaluated by using the following MSE (mean squared 
error) over all S test results. 



MSEcv = ^ E E (y^ - 2^(9^ ©«))'■ (13) 

s— 1 fji^Gs 

Here Gs denotes the s-th segment for the test, and &s denotes the optimal 
weights obtained by using D — Gs for the training. The extreme case of S = N 
is known as the leave-one-out method, which is often used for a small size of 
data. Since either poorly or overly fitted networks show poor generalization 
performance, by using cross-validation, we can select the optimal J* and A*. 

2.5 Restoring Nominally Conditioned Polynomials 

Assume that we select an adequate network as the best law-candidate by using 
the above criterion. Let V = ivjki) and W = (wj,Wjk) be the final weights in 
the network. The weights W are used to form polynomials, while V are used to 
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decompose the network into multiple nominally conditioned polynomials. The 
decomposing procedure is shown below, where we introduce a threshold parame- 
ter e to absorb small deviation of weights and avoid the growth of useless rules. 



step 1: Shift constant weights to zero. For each hidden unit j, repeat the 
following: for each k having much the same weight Vjki, that is, Taaxi(vjki) — 
mmi{vjki) < e, shift the weights to zero (i.e., Vjki = 0) after multiplying its 
contribution to Wj. Here such Vjk is called constant. For all constant Vjk, 



w'j = Wj exp 




(14) 



step 2: Extract conditioned terms. For each j, repeat the following: by 
selecting one dummy variable from each non-constant Vjk, concatenate them. 
Clearly, all these concatenations exhaustively cover the space. For each concate- 
nation u, we have a set Qj-.u of dummy variables; for example, for non-constant 
Vj2 and Vj3, we have Qj-,i = {g2i,<73i}, Qj-.2 = {921,932}, and so on. For each u, 
multiply its contribution to the weight w'j to get Wj-u, 



Wj:u = Wj exp ^ Vjki ■ ( 15 ) 

Then, for each j we have a set of nominally conditioned terms each of which has 
the following form. If \wj-,u\ is small enough (i.e., \wj-,u\ < e), then set as tj-,u = 0. 

/ K2 

if f\ 9/ci = 1 then tj.,u = Wy,u exp I ^ Wjk Inxfc 

9fciGQj:u \fc=l 

step 3: Combine conditioned terms. Combine the conditioned terms gener- 
ated for hidden units j and 9' . The combinations are generated exhaustively, and 
for each combination if-parts are conjunctively connected and terms are simply 
added. Each combination has the following form. 




if 





A 



hk'V 



^Qj 




then tj"-/fj." — tj-212 tjf.^ii' (IT) 



When the truth value of a combined if-part is always false, discard the cor- 
responding combination. Repeat such combination one by one until the final 
hidden unit is combined. Finally, by adding the common constant term wq, we 
can restore a set of rules as defined in Eq. ( 0 . 



3 Evaluation by Experiments 

By using two data sets, we evaluated the performance of the RF6 method. 
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3.1 Artificial Data Set 

We consider an artificial law described by 



I 




(18) 



where we have 3 nominal and 9 numeric explanatory variables, and the numbers 
of categories of gi, q 2 , gs are set as Li=2, 1^2=3 and 1^3=4, respectively. Clearly, 
variables qi,q 3 ,xe, - ■ ■ ,xg are irrelevant to Eq. (1181) . Each example is generated 
as follows: each value of nominal variables gi , g2 , ga is randomly generated, each 
value of numeric variables xi, • • • , xg is randomly generated in the range of (0, 1], 
and we get the corresponding value of y by calculating Eq. (m and adding 
Gaussian noise with a mean of 0 and a standard deviation of 0.1. The number 
of examples is set to 400 (A=400). 

In the experiments, the initial values for the weights Vjki and Wjk were inde- 
pendently generated according to a normal distribution with a mean of 0 and a 
standard deviation of 1; the initial values for the weights Wj were set to 0, but 
the bias value wq was initially set to the average output value of all training ex- 
amples. The iteration was terminated when the gradient vector was sufficiently 
small, i.e., ||V£(0)|p/M < 10“®, or the total processing time exceeded 100 sec- 
onds. The penalty factor A was changed from 10x10° to 10x10“® by multiplying 
by 10“^, and the number of hidden units was changed from 1 to 4 ( J=l,2,3,4). 
For each J and A, trials were repeated 10 times with different random numbers. 

Figure Q shows the performance of RF6, where the best RMSE (root mean 
squared error) was used for evaluation. The best RMSE for the training data 
was minimized when J=4, while for a set of 10,000 test examples generated 
independently to the training examples, the best RMSE was minimized when 
J=3; this indicates the optimal number of hidden units seems to be J*=3. 

For J=3, an example of the laws discovered by RF6 was as follows: 



I Q ^ 0g(O.O6gii-t-O.O6gi 2 — 0.03<J21 -t-0.07(j22 -1-0.07(723 +0.03(731-1-0. 03(^32 -1-0.03933-1-0. 03^34 ) 



where the weight values were rounded off to the second decimal place. Below 
we show how the decomposing procedure described in Section 2.5 works for this 
problem. Here e is set to 0.05. In step 1, for j=l, Vn and U13 are constant, 
and we have w[ = 9.16exp(0.06 -I- 0.03) = 10.02; for j=2, V 21 and V 23 are 
constant, and w '2 = 0.046 exp(0. 00) = 0.046; for j=3, U31 and V33 are constant, 
and 7+3 = 0.90exp(0.00) = 0.90. In step 2, for j=l, we have the following set: 



y = -7.n 




(19) 



if Q 21 = 1 then wi,i = w[ exp(— 0.03) = 9.72, ti-,i = 9.72 

if Q 22 = 1 then wi -,2 = w[ exp(0.07) = 10.75, ti,2 = 10.75 

if 923 = 1 then wi -,3 = w[ exp(0.07) = 10.75, ti,2 = 10.75. 
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best RMSE best RMSE 





(a) RMSE for training data (b) RMSE for test data 

Fig. 1. Performance of RF6 for artificial data set 



Similarly, for j = 2 and j = 3, we have the following: 



if <721 = 1 

if <?22 = 1 

if <723 = 1 
if <721 = 1 
if <722 = 1 
if <723 = 1 



then W 2 :i 
then W 2:2 
then W 2:3 
then W3:i 
then W3:2 
then W 3:3 



w'2 

w'2 

w'2 

w'3 

w'3 



exp(4.17) = 2.98, t 2 -.i = 

exp(— 1.73) = 0.008, ^2:2 = 0 

exp(— 2.44) = 0.004. t 2:3 = 0 

exp(— 3.00) = 0.045, tsa = 0 

exp(1.50) = 4.03, ^3,2 = 4.03x3°°a;4^®x^°'^^ 

exp(1.49) = 3.99. ^3:3 = 3.994°°a;!J-^®a:^°-^^ 



In step 3, by simply combining the corresponding terms and adding the common 
constant wq = —7.71, we can restore the following law 

( if <l 2 i = 1 then y = 2.01 + 2.98x)"^ °°a;2 ®® 

\ if Q 22 = 1 then y = 3.04 + 4.03x3 (20) 

i if 923 = 1 then y = 3.04 + O.OOxa ^^xIj '^^Xg 

Although some weight values were slightly different, a law almost equivalent to 
the original one was found. This shows that RF6 is robust and noise tolerant 
to some degree. Note that without preparing some appropriate prototype func- 
tions, existing law discovery methods ioiciiDi cannot find such laws as described 
in Eq. Moreover, a set of multiple nominally conditioned polynomials was 
simultaneously discovered. These are the important advantages of RF6 over ex- 
isting law discovery methods. 
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3.2 Automobile Data Set 

The Automobile data se10 contains the cars and trucks specifications in 1985, 
and was used to predict a price based on its specifications. The data set consists 
of 11 nominal and 14 numeric explanatory variables and one target variable 
(price). The data set has 159 examples with no missing values {N = 159). 

The previous research applied linear regression and a nearest neighbor- 
based method called IBL by using only numeric data, and evaluated their perfor- 
mances in terms of MDE (mean deviation error) = — y^\/y^, where y 

denotes the estimate of y. The MDE of the IBL was 11.8%, and 14.2% for linear 
regression, where all the examples were used for both training and test. 

In the experiments of RF6, the initialization, terminating conditions, and the 
value range of A were exactly the same as the former experiments. The number 
of hidden units was changed from 1 to 3 (J = 1,2,3), and for each J and A, 
trials were repeated 5 times with different random numbers. Before the analysis, 
the variables were normalized as follows: y = {y — mean{y)) / std(y) = {y — 
11445.73)/5877.86, and Xk = Xk / max{xk) , k = 1, • • • , A 2 . These normalizations 
guarantee the consistency of discovered laws (See the Appendix). 

We applied RF6 to three cases: 1) data of only numeric variables, 2) data 
of numeric variables plus one nominal variable ’car-maker’, and 3) data of all 
variables. For comparison we applied ridge regression m, linear regression with 
the squared penalty. The nominal variable ’car-maker’ was selected for the second 
case because it has the widest value range when ridge regression was applied to 
the third case. The best generalization performance of the third case (MSEpy = 
0.0725 for J=l) was inferior to that of the second case (MSEpy = 0.0676 for 
J=l) because of the over-fitting to the training data. Thus, we omit the third 
case here. We measured the performance by using the MSE, MDE, MSEcy and 
MDEcy. Table 0 shows the best performance of ridge regression and RF6 for 
two cases, and Figure compares the best root MSE and root MSEcy for the 
second case. 



Table 1. Best performance comparison for automobile data set 



variables used 


only numeric vars 


numeric vars -I- car-maker 


method 


MSE 


MDE 


MSEcv 


MDEcv 


MSE 


MDE 


MSEcv 


MDEcv 


ridge reg. 


0.1484 


14.13 


0.2044 


15.83 


0.0474 


8.33 


0.0969 


10.18 


RF6, J=1 


0.0962 


11.27 


0.1669 


12.76 


0.0325 


8.00 


0.0676 


10.09 


RF6, J=2 


0.0423 


8.52 


0.0986 


10.99 


0.0141 


5.48 


0.0812 


9.95 


RF6, J=3 


0.0252 


6.94 


0.0977 


10.67 


0.0040 


3.02 


0.0885 


9.74 



These results tell us the following. Both ridge regression and RF6 greatly im- 
proved the performance by the addition of the nominal variable ’car-maker’. For 

^ We obtained the data from the UCI repository of machine learning databases. 
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best RMSE 



best RMSE 





(a) RMSE for training data 



(b) RMSE for test data 



Fig. 2. Performance comparison for automobile data set (numeric + car-maker) 



both cases of data RF6 exceeded ridge regression in any measurement. Among 
the models, RF6 with one-hidden unit J=1 for numeric and ’car-maker’ variables 
stably showed the best generalization performance; the true best performance 
was MSEcy = 0.0676 for A = 10“"^. With the increase of J, RF6 rapidly im- 
proved the MSE and MDE for the training data, but this does not always mean 
the improvement of the generalization performance. 

For J=1 with numeric and ’car-maker’ variables, RF6 discovered the law: 



y = —1.17 -1- 2.61 exp(0.67gi -I- 1.05(72 — 0.12(73 — 0.08(74 -1- 0. 28(75 — 0.03(76 
-fO. 50(77 A 0.31(78 — 0.16(79 — 0.00(7io A 0.19(7n — 0.08(/i2 -k 0. 49^x3 
-1-0. 62(7x4 -k 0.05(7x5 + 0.11(7x6 ~k 0.39(7x7 + 0.36(7xs) 



X 



2.40 ~- 2.31 



X 






1.38 —- 4.10 — 3.24 — - 0.26 — 0.18 — - 0.26 — 0.22 — 0.34 4 ^ 1.23 — - 0.42 - 0.05 



X 



X, 



X. 



X 



X, 



10 



X 



11 



X 



12 



X 



13 



X 



14 



( 21 ) 



where the above qk denotes the fc-th category of ’car-maker’, and the weights 
values are rounded off to the second decimal place. Note that this formula can be 
easily decomposed into a set of rules. Since the second term in the right hand side 
of Eq. (3) is always positive, a weight for qk indicates the price setting tendency 
of the fc-th car-maker for the similar specifications. A large positive weight such 
as 1.05 (bmw) indicates high setting, while a negative weight, for example —0.16 
(mitsubishi), low setting. Moreover, car- makers having the similar weights had 
the similar price setting in 1985; for example, 0.67 (audi) « 0.62 (saab), 0.50 
(mazda) « 0.49 (porsche), 0.39 (volkswagen) ss 0.36 (volvo), 0.31 (mercedes- 
benz) « 0.28 (honda), 0.05 (subaru) « 0.00 (nissan), —0.08 (dodge) 0.08 
(plymouth), and —0.12 (Chevrolet) ss— 0.16 (mitsubishi). 
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4 Conclusion 

To discover a law in the form of multiple nominally conditioned polynomials, we 
have proposed a new connectionist method called RF6. RF6 can simultaneously 
discover both nominal conditions and corresponding polynomials via learning 
of single neural networks. Experiments showed that RF6 successfully discovered 
a set of nominally conditioned polynomials whose powers are not restricted to 
integers, even if the data contained irrelevant variables and small noise. 

However, RF6 has some difficulty if subspaces are properly divided only by 
using numeric variables, or a numeric law has a term beyond polynomials, such 
as l/(a;i + X 2 ) or (a;i + Moreover, the current RF6 requires heavy load 

in selecting an optimal subset of nominal variables when the inclusion of all 
nominal variables degrades the generalization performance. 
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Appendix: Consistency of Polynomial with Penalty Term 

Consider the following polynomial as a regression function, where Xk is a numeric 
or dummy variable. If Xk is a dummy variable, Xk = e for 1, and Xfc = 1 for 0. 

y{x; 0) = wo + '^ Wj = wq + ^ wj exp ^ wjk In a;fc , 

j k j \ k / 

where 0 = (wj, Wjk)- The generalization performance of y{x; 0) is measured by 
Gerriy) = EoiETHy ~ y{x; 0*{D)))\ 

where D and T denote training data and test data respectively. Moreover, Ed 
and Et denote expectation over D and T respectively, and 0*{D) is the optimal 
weights dependent on D. Since the squared penalty is confined to the weights 
Wjk, the penalized target function is given as follows: 

^ y j k J 

We consider variable normalization: y = cy + d, Xk = akXk, where c yf 0 and 
flfc > 0. The generalization performance after the normalization is measured by 

G,rr{^ = E~[E~[{y-y{x;0{D)))% 

We call y{x;0) consistent if Gerr(y) = c^Gerriy)- The optimal weights before 
the normalization, 0* , satisfy the following. Here e*^ = exp(^^ Inx^). 

r d£/dwo = - EM 

\ d£/dw, = - EM M M*MT = 

[ d£ldwjk = - EM T,j' + Mk = 0- 

~ * 

The optimal weights after the normalization, 0 , satisfy the following. Here 
g;;^ = exp(X:fcW*fclnJ^). 

f d£Jdw^ = - EM - ^0 - Ej 

d£Jdwj = - EM “ ^5 - Ef 

[ d£jdwjk = - EM “ - Ey + Mk = 0- 

We see the following satisfy the above two sets. Here a* = exp(^^, In Ofc). 
Wo = cwq + d, Wj = —Wj , Wjk = Wjk, •^ = c A. 



It is clear that the above guarantees Gerr{y) = <^Gerr{y)- 
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Abstract. We propose a projection mapping H-Map to reduce dimen- 
sionality of multi-dimensional data, which can be applied to any metric 
space such as L\ or L^a metric space, as well as Euclidean space. We 
investigate properties of H-Map and show its usefulness for spatial in- 
dexing, by comparison with a traditional Karhunen-Loeve (K-L) trans- 
formation, which can be applied only to Euclidean space. H-Map does 
not require coordinates of data unlike K-L transformation. H-Map has 
an advantage in using spatial indexing such as R-tree because it is a 
continuous mapping from a metric space to an Loo metric space, where 
a hyper-sphere is a hyper-cube in the usual sense. 



1 Introduction 

The method of mapping multi-dimensional data into space of lower dimension is 
one of the most important subjects in multimedia database, as well as in pattern 
recognition. In statistical pattern recognition, extracting a feature defined by 
a combination of data attributes plays a central role and one of the central 
issues is to find a feature that can extract information as much as possible. 
In information retrieval of multi-dimensional data, to realize efficient spatial 
indexing it is necessary to project data into space of lower dimension. To take 
fully advantage of such an indexing structure as R-tree p], it is important to 
keep information as much as possible in projected data. 

Karhunen-Loeve (K-L for short) transformation has been used as dimension 
reduction mapping in various problems |^ . It is essentially the same as principal 
component analysis in multivariate analysis. In K-L transformation, data is as- 
sumed to be in a Euclidean metric space with coordinate values, and a feature is 
obtained as a linear combination of coordinate values, which can be considered 
as an orthogonal projection in the Euclidean space. 

However, in applications such as multimedia database, Euclidean distance is 
not always assumed. In the present paper, we propose a dimension reduction 
method, named H-Map. H-Map can be applied to arbitrary metric space. It re- 
quires only distances between data and no coordinate values. We compare H-Map 
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with K-L transformation for several sample data sets and show its effectiveness 
as dimension reduction mapping. 



2 Metric Space 

A distance function D : X x X ^ di of a, metric space X satisfies the following 
axioms of distance for any x,y,z in X: 

(1) D{x, y) >0 {D{x, y) = 0 if and only if x = y), 

(2) D{x,y) = D{y,x), and 



(3) D{x, z) < D{x, y) + D{y, z). 

The condition (3) is also referred as triangle inequality. 

We assume that any point x in a space is represented by an n-tuple (x(i), 
X( 2 ), . . Xf^n)) of real numbers. Then, the following three functions satisfy the 
axioms of distance: 



L 2 (Euclidean) metric: I?(x, y) = 

n 

Li metric: D{x,y) = ^ |x(^) - y(^) 



A -yw)" 

\ i=i 



Loo metric: L>(x, y) = imx |x(*) - yp) | 



The set of points that are the same distance from some point is called an 
equidistant surface. In a 2-dimension space, equidistant surfaces in Euclidean 
metric space, Li metric space, and Loo metric spaces are a circle, a diamond 
with diagonals parallel to axes, a square parallel to axes, respectively, as shown 
in Fig. 13 



3 Dimension Reduction Mappings: H-Map and K-L 
Transformation 

3.1 H-Map 

For any a and & in a space X with distance function D, we define a mapping 
ifab as follows: 

, . D(x, a) — D(x, b) 

^ab\x) = ^ . 

The pair (a, 6) is called a pivot. For an m-tuple of pivots II = ((ai,6i), . . ., 
{o,m,bm)), we define a mapping <I>n : X — > 3?™ from X to m-dimensional space 
3?"* by 

<Pn{x) = {(fiaibdx),ipa 2 b 2 {x), . ■ . ,(pa^b^(x)). 

We call a mapping <Pn ■ X 3?™ an H-Map. We can show the following lemma. 
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Fig. 1. Equidistant surface 



Lemma 1. For any a,b,x,y in a metric space X, \<Pab{x) — ‘Pab(y)\ < D{x,y). 

Proof. Let D{x,y) = n. From triangle inequality, \D{x,a) — D{y^a)\ < D{x,y) 
= n and \D{x, h) — D{y, b)\ < D{x, y) = n. Therefore, 

I D{x,a) - D{x,b) - {D{y,a) - D{y,b)) 

\y}ab{x) - yyab{y) \ = ^ 



□ 



D{x, a) - D{x, b) - {D{y, a) - D{y, b)) 
2 

\D{x, a) - D{y, g)| + \D{x, b) - D{y, b)\ 
2 



Theorem 1. Let D' be the Loo distance function in 5R"*. Then, for any tuple of 
pivots 77, 

D'{4>n{x),4>n{y))<D{x,y). 

The above theorem says that any H-Map mapping is a continuous mapping 
from a metric space to an Loo metric space. Furthermore, H-Map has the fol- 
lowing properties: 

(1) H-Map can be applied to any metric space such as Li or Loo metric space, 

as well as Euclidean metric space. Here, we should note that Lemma 1 and 
Theorem 1 depend only on the triangle inequality. D 

(2) H-Map EEn be defined only from distances without coordinate values. 

3.2 K-L Transformation 

A dimension reduction mapping can be considered as a feature extraction func- 
tion in pattern recognition. By using K-L transformation, we can obtain an 
optimal linear mapping from 5R” to JR*" which minimizes the square mean er- 
ror after mapping. The method is essentially the same as the method used in 
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x’ 






X 



> 



Fig. 2. Orthogonal projection by K-L transformation 



principal component analysis. A mapping obtained from K-L transformation is 
an orthogonal projection in Euclidean space. Usually, K-L transformation as- 
sumes that the metric is Euclidean. Erom the viewpoint of computation algo- 
rithm, it requires coordinate values of data. In Fig. Bn an example of data set 
of 2-dimension and its projection axes (x' ,y') obtainea by K-L transformation, 
where the orthogonal projection to the principal axis x', which minimizes the 
square mean error, gives the optimal linear transformation from 2-dimension 
to one-dimension 5R. 



4 Comparison of H-Map with K-L Transformation 

Since a dimension reduction mapping that keeps the distance relation of data 
as much as possible can scatter mapped data in wider range, such a mapping 
is required to achieve high efficiency in retrieval. Here, in the present paper, 
we take the variance of mapped data as the criterion for good mapping. The 
optimal linear mapping in this sense is obtained by K-L transformation. The 
larger variance gives the more accurate answer in retrieval. In what follows, we 
consider three data sets in 2-dimensional space as examples to compare H-Map 
with K-L transformation on them and show the usefulness of H-Map. 

For simplicity, we assume the dimension of data before projection is 2. As for 
data sets, we consider three distributions shown in Fig. ^ where • represents a 
data and © represents two data on the same point. The jWncipal axis, on which 
the orthogonal projection gives the largest variance, of K-L transformation for 
data sets (1) and (2) are the same as the original x-axis. For data set (3) it is 
drawn as x' by dotted line. 

First, we assume data sets are in a Euclidean space, where both H-Map and 
K-L transformation can be directly applied. Eor H-Map we take two most distant 
points as the pivot of mapping, which gives relatively good projection. Thus, two 
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Fig. 3. Data Sets 



end points on the principal axis of K-L transformation are taken as the pivot. 
For data set (1), the pair of (—3,0) and (3,0) is taken as the pivot. 

For non-Euclidean metric space such as L\ and L^o, K-L transformation 
cannot be applied. Let C/(r) be the sets of points within distance r from a point 
p, that is, Cf{r) = {x\D{x,p) < r} using Lf metric, for each / = 1,2, or oo. 
Then, the following relations are clear from Fig.^; 

Ci(r) C C 2 (r), and C'oo(n) C C2(V2r) . 

Therefore, we can retrieve all the necessary data by using C' 2 (r) instead of Ci(r). 
Similarly, retrieval of Coo{r) is approximated in Euclidean space, where the ra- 
dius r of retrieval should be enlarged by a factor -\/2 {^/n for n-dimension in 
general). 

Eor each data set in Eig. 3, the results of projection by H-Map based on 
three metrics Li, L 2 , and LoD and the results by K-L transformation on L 2 are 
summarized by variances in Table 1. 

First, let consider cases when ile metric is Euclidean. No matter whether the 
principal axis is parallel to the original one (data sets (1), and (2)) or not (data 
set (3)), all the variances are almost the same. Precisely, for data set (2), H-Map 
gives a larger variance than K-L transformation. Thus, even for data set with 
Euclidean distance, there exists a case where H-Map gives a better projection 
than K-L transformation. 

For the cases of Li or L^o metric in data sets, we compare the results by 
direct application of H-Map and those by application of K-L transformation to 
Euclidean distances. When the principal axis of K-L transformation is parallel 
to the original one (data sets (1) and (2)), H-Map gives the same variance as 
K-L transformation. On the other hand, for data set (3), the variance of H- 
Map is twice or half of K-L transformation with respectively to case L\ or Loo- 
Therefore, for data sets with Li metric, we may conclude that if the principal 
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Table 1. Variances of projected data 



Data Set 


(1) 


(2) 


(3) 


metric 

projection 


Li 




loo 


Li 


4 


ioo 


Li 


L2 


io. 


H-Map 


3.000 


2.945 


3.000 


3.429 


3.434 


3.429 


6.667 


3.333 


1.667 


K-L transformation 


- 


3.000 


- 


- 


3.429 


- 


- 


3.333 


- 



axis is parallel to the original axis there are no differences between H-Map and K- 
L transformation, otherwise H-Map gives the larger variance in projected data. 
Since the radius in approximating retrieval of Loo data sets by Euclidean distance 
should be enlarged, H-Map gives better projection than K-L transformation for 
Loo data sets. The difference of H-Map from K-L transformation is larger when 
the principal axis is parallel to the original axis. 

5 Concluding Remarks 

In this paper, we proposed H-Map as a dimension reduction mapping for multi- 
dimensional data and show its properties and effectiveness. While traditional 
methods like K-L transformation assume Euclidean metric space, H-Map can 
be applied to any metric space. This shows the advantage of H-Map to other 
methods. Eor example, H-Map is applicable the set of character strings with 
edit distance. Edit distance is not Euclidean and there is no clear coordinate 
representation. The effectiveness of H-Map as dimension reduction mapping has 
been shown by comparison with K-L transformation for several sample data sets. 
Erom the comparison we can observe that H-Map has a merit relative to K-L 
transformation even for data sets in Euclidean metric space as well as in Li or 
Loo metric space. 

As stated in Theorem 1, H-Map is a continuous mapping from a metric 
space to an Loo metric spaQ. This fact shows another advantage of H-Map to 
other mapping based on orthogonal projection in Euclidean space when R-tree 
[3] is used as a spatial indexing structure. A node in R-tree contains data in a 
EJper-box parallel to coordinate axes of index space. Thus, R-tree separates the 
index space into hyper-boxes. When the metric in index space is Euclidean, the 
answer of a range query with radius r is the set of data within a hyper-sphere. In 
matching a sphere with boxes, there might be much possibilities of unnecessary 
matching with data outside of the query range. On the other hand, the images of 
H-Map are located in an Loo metric space, where hyper-spheres are hyper-boxes 
in the usual sense. 

Shinohara et al. [4] realized the method of embedding multi-dimensional data 
in discrete Li metri(Qpace into Euclidean space to be projected by EastMap [1]. 
Although the proposed method H-Map in the present paper can be considered a^ 
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a simplification of the method in the applicability of H-Map is much wider. 
In this paper, we investigate the usefulness of H-Map only by applying it to 
several sample data sets. Generalizing the observed properties of H-Map is one 
of the most important future subjects. Furthermore, as future subjects, we have 
to consider algorithms of pivot selection for H-Map and utilize H-Map to realize 
efficient spatial indexing for approximate retrieval of multimedia database. 
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Abstract. This paper proposes Normal Form Transformation (NFT) as 
a preprocessing of Support Vector Machines (SVMs) . Object recognition 
from images can be regarded as a fundamental technique in discovery sci- 
ence. Aspect-based recognition with SVMs is effective under constrained 
situations. However, object recognition from rotated, shifted, magnified 
or reduced images is a difficult task for simple SVMs. In order to circum- 
vent this problem, we propose NFT, which rotates an image based on 
low-luminance directed vector and shifts, magnifies or reduces the image 
based on the object’s maximum horizontal distance and maximum ver- 
tical distance. We have applied SVMs with NFT to a database of 7200 
images concerning 100 different objects. The recognition rates were over 
97% in these experiments except for cases of extreme reduction. These 
results clearly demonstrate the effectiveness of the proposed approach in 
aspect-based recognition. 



1 Introduction 

Extracting information from images is an important basis of discovery science. 
Object recognition, which represents a fundamental step in these tasks, is an 
estimation of the class of an image based on various information of the image. 

Model-based object recognition requires many images in order to produce 
an accurate model. On the other hand, aspect-based object recognition requires 
neither a precise model nor many images, and can be applied to various problems. 
In this paper, we deal with aspect-based object recognition. Among various 
approaches in aspect-based object recognition, an approach based on support 
vector machine (SVM)|^ has achieved higher accuracy than the conventional 
methods. 

However, object recognition based on SVM proposed by Pontil has difficul- 
ties in recognizing rotated, shifted, magnified or reduced images. These trans- 
formations into images correspond to conventional changes, such as a change of 
viewpoint and a movement of an object. 

We propose normal formation transformation (NFT) in order to recognize 
rotated, shifted, magnified or reduced images. NFT can deal with a rotated 
image by means of low-luminance directed vector, which indicates a direction 
from a position of a low luminance to a position of a high luminance. Moreover, it 
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can deal with a shifted, magnified or reduced image by using object’s maximum 
horizontal distance and maximum vertical distance. 

We performed experiments on recognition of 100 objects concerning 7200 
images. The recognition rates of SVM based on NFT (SVM-NFT) were 97-98 
except for cases of extreme reduction. 



2 Problem for Object Recognition 



We prepare m objects and observe n images of t x k pixel from several directions 
with respect to each object. This image set 17 is given by 



UJrv — 



f Pll ■ ■■ Pil ■ ■■ Pli\ 
Plj Pij Pij 
\PlK ■ ■■ PiK - ■■ Plk/ 



G f2 , 



T = 1, 2, ..., m 
V = 1,2, ..., n 



( 1 ) 



where LOrv is the uth image with respect to the rth object, pij is a luminance of 
(z,jj pixel. We show, in Fig0 an example of images in the image set 17. 




Fig. 1. Example image of an object 



This paper considers a subset 17' of 17 as training examples and applies 
a classification algorithm to this object recognition problem. A classification 
model / that predicts an object of an unknown image 17) is obtained 

from 17'. If a classification model / satisfies the following equation, we regard 
that / recognizes the object t from the image firv 

fiPrv) = T (2) 



3 Support Vector Machines 

3.1 Description of the Problem 

Support vector machines (SVMs) 13151 are applied to a classification problem 
with a binary class and try to find a hyperplane that divides a data set leaving 
all the points of the same class on the same side. We give a brief explanation of 
the method below. 
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We first consider the following data set. 



(x,,y,), X, e SR" , , 

y, €{-!,!}, i = l,2,...,Z 

where x^ is an n-dimensional vector and I is the number of examples. Each vector 
Xi belongs to either of classes and is thus given a class yi. 

A hyperplane in an n-dimensional space is given by 

w X -I- 6 = 0 (4) 

where w is an n-dimensional vector, 6 is a coefficient and u-v is an inner product 
of u and v. Suppose, for the hyperplane each training data x^ satisfies 

yi{w ■Xi + b)>l-^i,i = l,...,l (5) 

where represents a degree of misclassification > 0). 

The distance between the hyperplane and a vector x is given by 
|w • X -I- b\/ II w II, where | • | is an unsigned magnitude and || • || is a norm. We 
define and d_ with respect to training examples each for which = 0 as 
follows. 



7 • |w-Xi+6| -I 

d+ = min i i ^n ,yi = l 

1 • |w-Xi-f6| -I 

d- = ,y^ = -l 



( 6 ) 



Here, -I- d_ is called a margin. The hyperplane that maximizes this margin is 
called an optimal separating hyperplane (OSH). SVM tries to find the OSH as 
the classification model. 



3.2 How to Obtain the OSH 

The OSH is given by solving the following optimization problem for maximizing 
the margin. Moreover, vectors that determine the OSH, support vectors (SVs), 
are obtained in this process. 

We solve the optimization problem of maximizing <P(\v,^) under the con- 
straints equation ©• 



^ II w f (7) 

i=l 

where C is a constant given by the user based on noise. Here, the optimization 
problem of equation (EJ under the constraints of equation m can be regarded as 
the primal problem of the quadratic problem Q . This primal problem is usually 
solved by means of the classical method of Lagrange multipliers Therefore, 
considering its dual problem we solve the following problem. 
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Maximize — ^ ^ ^ (x, • x,) + ^ a, 

i=l j — 1 i—1 

subject to 0 < ai < C , i = 1 , I (8) 

i 

'yi otiyi = 0 

i=l 

where Ui is a Lagrangian multiplier variable. The OSH is given by the optimal 
solutions Ui of the equation 

From Kuhn- Tucker conditions we obtain 

• Xi -I- 6) - 1 -I- fj] = 0 (9) 

(C-aOC* = 0 (10) 

From the equation o, in case ai = 0, the following equation is not necessarily 
satisfied 

2/i(w • Xj -b &) - 1 -bCi = 0 (11) 

Thus, Xi for which = 0 has no influence on deciding the OSH. From the 
equation lilull . in case a.i = C, is not necessarily zero. Therefore, x^ for which 
cti = C has two kinds of situations. One is a situation that the vector x^ crosses 
over the OSH and is misclassified (1 < ^i), the other is a situation that x^ is 
closer to the OSH than a SV (0 < < 1). In this paper, we regard x^ for 

which cti = C(0 < < 1) as a misclassified vector. Hence, a vector x^ for which 

0 < di < C is called a support vector (SV), which decides the OSH. 

The OSH is given by 



i 

w = ^ diXiyi 
b = -iw • (xr -b Xs) 

where x^ and x^ are SVs for each class {yr = 1 and j/s 



( 12 ) 



— 1 respectively). 



4 SVM-NFT 

4.1 Training Method 

SVM-NFT is applied to a problem for object recognition in section ^concerning 
two objects. SVM-NFT selects two objects from m objects, and apply our pre- 
processing: NFT, to 2 X n images with respect to these objects. We consider a 
pixel of an image after the preprocessing as an attribute. A classification model 
for discriminating these objects is produced by applying SVM to a 3 x 6 x k- 
dimensional vector set (if color images) or to an /, x K-dimensional vector set (if 
gray-level images). Since SVMs deal with a binary classification problem, SVM- 
NFT produces mC 2 classification models in order to recognize m objects, where 
C represents a binomial coefficient. 
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NFT deals with rotated images by using a low-luminance directed vector, 
which indicates a direction from a position of a low luminance to a position of 
a high luminance in an image. Moreover, NFT can deal with shifted, magnified 
or reduced images by using the maximum horizontal distance and the maximum 
vertical distance each of which is given by luminance information of the image. 

NFT rotates an image based on the following procedure in order to recognize 
a rotated image. This procedure is based on a low-luminance directed vector and 
a prime-luminance directed vector. The low-luminance directed vector is a vector 
that indicates a direction from a position of a high luminance to a position of a 
low luminance. In order to unify the low-luminance directed vectors concerning 
each image, we define the prime-luminance directed vector. A rotated image 
can be recognized by rotating its low-luminance directed vector to the prime- 
luminance directed vector. 

The low-luminance directed vector is given by the following method. A pixel 
can be regard as a point distributed in a three-dimensional space where A-axis, 
M-axis and Z-axis correspond to the horizontal axis, the vertical axis and the 
luminance, respectively. Here, we consider pixels with positive luminance. These 
points are given by. 



where I is the number of pixel in the space, Q is an X coordinate, rji is an Y 
coordinate and Vi is a luminance value of (Ci-iVi) pixel. A least-squares method 
is applied to these points. A hyperplane given by the least-squares method is 
represented by equation (|l4|| . 



(Ci) Vij ^i) ) * — 1)2, ••.) I 



(13) 
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Fig. 2. Transformation of an image into space information 



ax + by + z + c= 0 



(14) 



where 







(15) 




J 



( 16 ) 
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A normal vector of equation m is (a, 6,1). The low- luminance directed 
vector (a, 6) is given by projecting this normal vector into the X-Y plane. 

Next, for recognizing shifted, magnified or reduced images, we calculate the 
maximum horizontal distance H and the maximum vertical distance V which 
correspond to the maximum value of horizontal distance and the maximum value 
of vertical distance concerning an object respectively. Magnification r is esti- 
mated with these distances. 



where min(ai, 0 ! 2 , ■ ■ ■ ) represents the minimum value with respect to cti, i = 
1, 2, • • • . NFT magnifies or reduces an image at magnification r. Moreover, NFT 
moves the center of a rectangle defined by the maximum horizontal distance and 
the maximum vertical distance into the center of a region of an image size l x k. 

4.2 Testing Method 

SVM-NFT first applies NFT to test examples. SVM-NFT then predicts the 
object of a test example by applying the classification model to the example. 

Recognition is performed by the following rule of tournament. In the classi- 
fication problem concerning m objects, each object is regarded as a player. In 
each match, SVM-NFT classifies a test example based on the OSH. If the players 
are objects i and j{> i) in a certain match, SVM-NFT obtains either i or j as 
a class of a test example based on the following equation. 



where and h(i,j) are parameters of the OSH with respect to objects i 

and j. In the next match, the previous match winner plays a game with the 
object j -b 1. If, for simplicity, there are m players, there exist to — 1 matches. 
SVM-NFT outputs a winner in the final match as a class of a test example. 

5 Experimental Results 

5.1 Experimental Condition 

We estimate the performance of SVM-NFT by using the COIL (Columbia Object 
Image Library) database^ as an image set. 

We performed experiments on four methods. These methods are simple SVMs, 
SVMs based on NFT (SVM-NFT), perceptrons)^ based on NFT and back- 
propagation based on NFT. 

The average value of a least-squares error of perceptrons was set to 10“^. 
On the other hand, we set parameters of the back-propagation as follows: the 
number of loop to a convergence (10^), least-squares error (10“^), the number 
of neuron in a hidden layer (10 neurons). The tournament method was applied 
in each method. 



r = m: 




(17) 




(18) 
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5.2 Application to COIL 

COIL0 consists of 7200 images concerning 100 objects (see Fig0. The COIL 
images were produced by the following method. The objects were positioned in 
the center of a turntable and observed from a fixed camera. The turntable was 
rotated of 5° for each object, and 72 images were observed. Examples of images 
are shown in Fig. 0 The COIL images are 24 bits color images of 128 x 128 
pixels and noise existed in the background of images. 




Fig. 3. Examples of images in COIL database 




Fig. 4. Examples of images for a rotated object 



Since a color image has a large quantity of information, this recognition 
problem is relatively easy. Therefore, each color image (77, G, B) was transformed 
into a gray-level image E by the following equation. 

E = 0.31i?-k0.59G-k0.10B (19) 

Moreover, a luminance of a pixel under 50 was set to zero in order to remove 
noise in the background. 

Next, since an image size of 128 x 128 pixels is large, the image size was 
reduced to 32 x 32 pixels by averaging a 4 x 4 pixels square region before the 
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application of SVM-NFT. This reduction was intended for reducing calculation 
time, and experiments showed that it has no significant effect on accuracy. 

A training set used in each experiment consists of 36 images (one every 10°) 
for each object, and a test set consists of the remaining 36 images (one every 
10 °). 

5.3 Experimental Results and Discussion 

Fig. 0 shows a comparison of recognition rates with respect to images each of 
which is shifted 10 pixels, rotated randomly and changed the magnification. Fig. 
0 shows examples of test images. 




Fig. 5. The recognition rate for images each of which is shifted 10 pixels, rotated 
randomly and changed the magnification. 




Fig. 6. Examples of test images 



We first consider the recognition rates of simple SVMs. Since, in a test ex- 
ample, a pixel is moved into a different position in a training example, the 
recognition rates are less than 10%. 

On the other hand, the recognition rate of SVM-NFT indicates 97-98% ex- 
cept for cases of extreme reduction. However, in cases of the reduction by a 
factor of below 0.5, difference of the low-luminance directed vector, the maxi- 
mum horizontal distance and the maximum vertical distance is large. Table 0 
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shows a comparison with test examples of the reduction by a factor of below 0.5. 
From the table, we see that average errors of degrees and of ratios increase as 
magnification decreases. 



Table 1. Average errors of degrees and of average ratios with respect to mag- 
nification. The second row indicates degrees between a low- luminance directed 
vector and a prime-luminance directed vector. The third row indicates ratios of 
the maximum horizontal distance to the maximum vertical distance. 



Average error 



Magnification Degree Ratio 



0.50 


0.89 


2.01 


0.40 


1.48 


2.86 


0.30 


2.21 


3.99 


0.25 


2.78 


5.15 


0.20 


3.71 


6.69 


0.15 


5.57 


9.30 


0.10 


8.88 


13.19 



The recognition rates of perceptrons based on NFT are nearly 70%. Since a 
margin of each perceptron is smaller than in the SVM, the recognition rate of 
perceptrons is considered low. 

On the other hand, recognition rates for back-propagation are nearly 60%, 
which is lower than those of perceptrons. We consider that overfitting occurred 
in this case. 




magnification 



Fig. 7. The comparison of recognition rates of SVM-NFT, where an image is 
rotated of 5° in a test image, (a) Recognition rates for images that have never 
been a SV. (b)Recognition rates for images that have been a SV. 
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Now, we show that some images are more likely to be misclassified. As de- 
scribed in section 0] SVM-NFT produces 99 classification models for an object. 
Consider images each of which has been a SV in a set of images rotated of 5°. 
Recognition rates for these images are shown in FigOa). Since these examples 
are relatively far from a classification model, they can be classified correctly even 
if they are given small fluctuations. On the other hand, images each of which 
has been a SV, are closer to a classification model, and are thus likely to be mis- 
classified even with small fluctuations. Fig. Qb) shows their recognition rates. 
Therefore, we consider that these examples are more likely to be misclassified 
by transformations. 

As a conclusion, we consider that SVM-NFT is an effective method with 
respect to these object recognition problems except for cases of extreme reduction 
from the results in Fig. 0 

6 Conclusion 

It is difficult for a simple SVM to predict an object from rotated, shifted, magni- 
fied or reduced images. Therefore, we proposed NFT as a preprocessing of SVM. 
In a recognition problem of 100 objects concerning 7200 images, the recognition 
rates of SVM-NFT were 97-98 except for cases of extreme reduction. Moreover, 
we performed experiments on perceptrons and back-propagation neural networks 
based on NFT. However, since a margin of SVM is larger than in a perceptron 
or in back-propagation, their recognition rates are smaller than SVM. Therefore, 
SVM based on NFT is effective in recognizing an object from rotated, shifted, 
magnified or reduced images. 
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1 Introduction 

We give a formal definition of discovery in terms of descriptional complexity, 
relative to a class of partial recursive functions. Formal definitions of discovery 
will hopefully give possibilities for systematic analyses of the discovery process. 

Discovery can be categorized into the discovery of singular or universal state- 
ments (expressions taken from Popper P|). Singular statements are scientific ob- 
servations made under a certain experimental condition. Universal statements 
are regularity, or law, observed under various experimental conditions. In this 
paper, we focus on the discovery of the latter type. Science may be regarded 
as the art of data compression p] . We should consider a good discovery to con- 
siderably compress information. For example, the discovery of Newton’s laws of 
physics has enabled us to compress massive amount of observational data into a 
simple equation. This idea is the central part of our definition. 



2 Definitions and Propositions 

The descriptional (algorithmic) complexity p] of a binary string x with respect 
to a partial recursive function (j) : E* —>■ E* is defined as C^{x) = min{|p| : 
4>{p) = x}, where |p| is the length of string p. 

Theorem 1 (Invariance Theorem p]). There exists a universal partial re- 
cursive function (po such that for any partial recursive function <p, there exists a 
constant such that for all x, C^g{x) < C^{x) -\- c^. 

Fix a universal partial recursive function <pQ. The Kolmogorov Complexity |E] 
C'(-) of binary string x is defined by C(x) = C^„{x). 

S. Arikawa, K. Furukawa (Eds.): DS’99, LNAI 1721, pp. 3 1 fi- II I 1999. 
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Definition 1 (Generalized Descriptional Complexity). Wt define the gen- 
eralized descriptional complexity B of a binary string x as the descriptional com- 
plexity relative to a class of partial recursive functions D and a certain encoding 
of the functions (whose length is given by 1), as follows: 

B’d^x) = min{^(T) + Ct{x) : T G D} 



Proposition 1. Let A be the class of generalized sequential machines (GSM). 
For binary strings x, B\{x) = fi{y/n) where n is the length of x, and I gives the 
length of the GSM in a standard state-by- state encoding. 

Proof. Let M be a GSM [Q. Let q be the maximum length of output for one 
transition in M . Given input p, the maximum length of an output of M is pq. 
Since q must appear somewhere in the description of M, 1{M) > q. lip computes 
X of length n, p = q = 0{y/n) minimizes B\{x). Therefore, B^{x) = 



Definition 2 (Discovery). Let D be a set of partial recursive functions. Let ^ 
be a certain partial order on integers. Given a binary string x and a function I 
which gives the length of a partial recursive function in a certain encoding, a pair 
{d, p) of a partial recursive function d and a binary string p is a discovery for x if 
the following properties hold: (1) d ^ D, (2) d{p) = x, (3) B)j{x) >- . 

We consider x represents the data gained from experiments and D represents 
laws that have been discovered up until now. The partial recursive function d 
may be regarded as the law discovered, and p describes how the data is acquired 
based on d. The partial order depends on the degree of discovery. Ghoices for 
^ may be: {a):B^^{x) > B'^JJ^^^^y{x) + c, {b):B‘jj{x) > c ■ or perhaps 

(c):log{B\j{x)) > B^jy^jr^Jx)-\-c, where c is a constant also expressing the degree 
of discovery. We may call a discovery satisfying for a particular c a c-discovery. 
We justify the definition as follows: The first property d ^ D requires a discovery 
to be new. The second property d(p) = x requires that we have actually discov- 
ered how to explain x. The last property requires that a discovery compresses 
observational data, as mentioned in the introduction. When a discovery is made, 
d is added to D, strengthening the descriptional power of our knowledge. The 
same or similar laws are not considered as discoveries furthermore. 

Proposition 2. Let U be a universal partial recursive function. Lf U £ D and 
I is a function that gives the length of a self-delimiting binary string repre- 
senting the index in an enumeration of Turing machines, there exists a constant 
c such that for all c' > c there can be no more c' -discoveries for any x. 

Proof. From the proof of Theorem VD[(7 (a;) < B\j{x)]. Since t/ is universal, 

there is a such that Va; [ Cu{x) < C{x) -\- Since Va; [B\j{x) < l(U) -\- 
Cu{,x)\, Va:[S^(a;) < C(a;) -I- 1{U) + c^^]. If c> 1{U) + B'-j^i^x) - c> is 

impossible for order (a). The proof is similar for orders (6) and (c). 
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Abstract. Feature selection methods search for an “optimal” subset of 
features. Many methods exist. We evaluate consistency measure along 
with different search techniques applied in the literature and suggest a 
guideline of its use. 



Feature Selection and Consistency Measure: Feature selection (FS) is an 
important preprocessing technique. A typical FS method consists of an evalu- 
ation measure that measures the goodness of a candidate subset, and a search 
technique to search through all candidate subsets to find the “optimal” one. 
Evaluation measures can be divided into 5 categories, such as: distance (e.g. 
Euclidean distance), information (e.g. information gain), dependency (e.g. corre- 
lation coefficient), classisifer error rate and consistency (e.g. incosistency rate). 
The suggested measure is an inconsistency rate over the data set for a given 
feature set. The inconsistency rate is calculated as follows: (1) two patterns are 
considered inconsistent if they match all but their class labels; (2) the inconsis- 
tency count for a pattern is the number of times it appears in the data minus 
the largest number among different class labels; and (3) the inconsistency rate is 
the sum of all the inconsistency counts for all possible patterns of a feature sub- 
set divided by the total number of patterns. Consistency measure is monotonic, 
fast, able to remove redundant and/or irrelevant features, and capable of han- 
dling some noise. There are many types of noise. The type that can be handled 
by consistency measure is the one that class labels are mistakenly flipped. 
Different Search Techniques: Five different algorithms represent standard 
search strategies: exhaustive - Focus complete - ABB p], heuristic - Set- 
Cover p), probabilistic - LVF pj, and hybrid of ABB and LVF - QBB. We ex- 
amine their advantages and disadvantages to find out which is the best under 
different circumstances. In the following M is the optimal number of features 
and N is the total number of features. 

Focus - Exhaustive Search: Focus starts with an empty set and carries out 
breadth- first search until it finds a minimal subset that predicts pure classes. 
As Focus is exhaustive search it guarantees an optimal solution. However, its 
time performance can deteriorate if M is not small with respect to N. ABB - 
Complete Search: ABB P] is an automated Branch & Bound algorithm having 
its bound as the inconsistency rate of the data set when the full set of features is 
used. Since inconsistency is a monotonic measure, ABB guarantees an optimal 
solution. However, ABB’s time performance can deteriorate as the difference 
N — M increases. SetCover - heuristic search: SetCover |2j uses the observation 
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that the problem of finding the smallest set of consistent features is equivalent 
to ‘covering’ each pair of examples that have different class labels with some 
feature on which they have different values. The advantages of SetCover is that 
it is fast, close to optimal, deterministic and works well for independent features. 
It may, however, have problems where features have inter-dependencies. LVF - 
probabilistic search: LVF randomly generates subsets and keeps the smallest 
subset that satisfies a threshold inconsistency rate. It is fast in reducing the 
number of features. But after some time it still generates subsets randomly and 
the computing resources is spent on generating many subsets that are obviously 
not good. QBE - Hybrid Search: QBB runs LVF in the first phase and ABB in 
the second phase so that the search is more focused after the sharp decrease in 
the number of valid subsets. A key issue remains: what is the crossing point in 
QBB at which ABB takes over from LVF. If we allow only certain amount of 
time to run QBB, an experimentally found cross-point is assigning equal time 
to LVF and ABB. 

Summary - when to use what: As we have five search techniques to choose 
from, we are also interested to know how we should use them. Theoretical analy- 
sis and experimental experience suggest the following. If M is small. Focus should 
be chosen; however if M is even moderately large. Focus will take a long time. 
If there are a small number of irrelevant and redundant features, ABB should 
be chosen; but ABB will take a long time for a moderate number of irrelevant 
features. For data sets with large numbers of features. Focus and ABB should 
not be expected to terminate in realistic time. Hence, in such cases one should 
resort to heuristic or probabilistic search for faster results. Although these algo- 
rithms may not guarantee optimal subsets but will be efficient in generating near 
optimal subsets in much less time. SetCover is heuristic, fast, and deterministic. 
It may face problems with data having highly interdependent features. LVF is 
probabilistic, not prone to the problem faced by SetCover, but slow to converge 
in later stages. QBB is a welcome modification as it captures the best of LVF 
and ABB. It is reasonably fast (slower than SetCover), robust, and can handle 
features with high interdependency. 
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1 Introduction 

In this paper, we attempt to model the process of noun vocabulary acquisition in 
young children. Noun acquisition plays a central role in the early stages of human 
language acquisition. In inferring meanings of words, children face a well known 
‘Quine’s induction problem’)^. Quine pointed out that it is logically impossible 
to pin down the reference of a word uttered in a situation only by situational 
cues. For example, when a child hears her mother utter a word ’rabbit’ pointing 
to a white fluffy animal eating a carrot, how can she conclude that ’rabbit’ do 
not refer to the animal’s long ear, its hair, or the animal eating a carrot, etc. 

Developmental psychologists have proposed that children’s word learning is 
constrained by a set of internal biases about how words are mapped onto their 
meanings. We propose a model of children’s noun vocabulary acquisition in the 
framework of inductive logic programming which embeds these constraints. 

2 Cognitive Psychological Constraints 

Besides the well known ontological constraints, we incorporated these biases 
children use in mapping novel nouns onto their meanings. The following biases 
have been proposed j^. 

By the whole object bias children assume that the referent of a given label 
is the entirety of an object. For example, in hearing the word ’rabbit’ children 
assume that it refers to the whole rabbit, instead of its ear, the color of the 
fur, or any other salient property of the animal. Furthermore, constrained by 
the object category bias, children assume that the novel noun refers to an 
object category rather than to an particular individual. In other words, children 
by default assume that a novel noun is a common noun rather than a proper 
noun. Children also assume that nouns refer to mutually exclusive ’basic level’ 
categories (the mutual exclusive bias) and relying on shape in determining 
the membership of the category referred by the novel noun (the shape bias). 

3 Sensory Inputs 

In this work, we introduce two separate predicates: (1) attr(ID, attribute, value), 
which states that the given target ID has ’value’ as its ’attribute’, and (2) ont(ID, 
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ont-prop, value), which states that ID has ’value’ as its ontology property ’ont- 
prop’. 

4 An ILP Model 

Inductive logic programming ^ is a framework for learning relational concepts 
when a set of positive and negative examples together with background knowl- 
edge are given. It provides a framework that is particularly suitable for acquiring 
concept descriptions. 

Our ILP model is most notably different from standard ILPs in that the 
system learns perpetually, just as human learning process, which takes place 
continuously during the life time. Since only one positive example is given for 
each session, we need to adopt a new evaluation criterion for selecting an appro- 
priate hypothesis. Furthermore, the confidence level for each needs be updated 
for each of the accepted hypotheses since the former is incremented through 
successive learning sessions. 

We adopted a Progol architecture for this purpose. That is, we first com- 
pute the most specific hypothesis (MSH) for a given example (a label) and the 
background knowledge (the sensory inputs and the already-known concept de- 
scriptions). 

The whole object bias is incorporated in the system to specify the sensory 
input. The object category bias is used to select inductive generalization among 
all other cognitive activities. The mutual exclusivity bias and the shape bias 
serve as a device for resolving a conflict when more than one label is given to 
the same target object. If a newly labeled familiar object is similar in shape 
to prototypical members of the old, familiar category, the new label refers to 
a category subordinate to the old category. In contrast, when the shape of the 
labeled object is different from prototypical members of the old category, the 
new label is regarded as referring to a category which is mutually exclusive to 
the old category. 

5 Current Status and Future Work 

We started a preliminary implementation. The system can learn some words and 
can resolve conflicts. We need more experimental studies and improvements of 
the model to show the feasibility of our approach. 
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1 Introduction 

As an image processing procedure which can extract a common feature from 
each of images belonging to the same class is considered to represent knowledge 
about the feature, automatic acquisition of the procedure is a kind of knowl- 
edge discovery process. A lot of expert systems have been proposed, including 
our system called IMPRESS^, for automatic generation of image processing 
procedures. Those systems can generate procedures that extract the shape of 
figures as precisely as possible. However, in application fields such as industry 
and medicine, it is often needed to produce procedures which meet the demand 
about misclassification rate per image. In this paper, we propose a new expert 
system called IMPRESS-Pro (IMPRESS based on Probabilistic model) to auto- 
matically construct a procedure from sample sets of classified images based on 
requirement of misclassification rate given by a user. 



2 IMPRESS-Pro 

Fig.l shows the outline of IMPRESS-Pro. The input data is a set of images 
consisting of samples of normal and abnormal images and sample figures. Each 
sample figure is a binary image which shows the abnormal part in the correspond- 
ing abnormal image sample. Requirements for misclassification rates expressed 
by probability of false positive(FP) and false negative(FN) are also inputted. 
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Here, FP is an event of misclassifying normal images as abnormal and FN a 
converse event of FP (missing of abnormal images). 

As shown in Fig.l, IMPRESS-Pro has a sequence of local processes whose 
algorithms and parameters are unfixed at first. Then, the system decides them 
sequentially at steps (b), (c) and (e), after estimating the condition for each 
process in step (a) from the requirements of the final misclassification rate based 
on the probabilistic model |^. At the parameter decision step, the system applies 
all of possible algorithms in a knowledge database to input images and selects 
the optimum one. Finally, the system outputs a sequence of local processes with 
concrete algorithms and parameter values. 

An example of image processing procedures acquired from thirty two ab- 
normal and normal real images (images of industorial parts) with requirements 
P{F.P.}<0.07 and P{F.N.}<0.07 is shown at the bottom of Fig.l. We applied it 
again to the input images and confirmed that both of the misclassification rates 
satisfied the requirements. 



sample figures jv abnormal images 



image database 
„ normal images 




requirement 

/Tor misclassification rate^ 
ex.)P{F.P. 1^0.07 
P{F.N.}^0.07 



a) estimation of 
requests 

for local process 
performances 




b) decision of smoothing and differentiation process 



c) decision of binarization process 



d) feature measurement of connected components 



) e) decision of classification process in a feature space <?' 



T 




image processing procedure^ 



an example of acquired procedures 



ap plication results 



differentiation 


binarization 


classification in feature space 


/ 8-Laplacian \ / 

\^difference distance = 3lj \ 


Type I 

lhreshold= 38 


( T( = 0.2459 ) 



Fig. 1. Outline of procedure construction process and experimental result. 



For dedails of IMPRESS-Pro, See @. 

3 Conclusion 

We proposed a new image processing expert system for knowledge discovery from 
an image database. This system can construct an image recognition procedure 
to satisfy the given condition of misclassification rates. The system was applied 
to a defect part extraction problem on LSI packages with promising results. 
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Following the development of computer science, we have got a lot of knowledge 
from the analysis of huge numerical data. It is likely that we can discover new 
science using this huge knowledge. In this paper, we propose a derivation of new 
statistical laws from time series data. Our approach is based on the combination 
of theory of the time series data analysis and the effective theory for complex 
systems. 

In the time series data analysis, the Auto Regressive (AR) type models are 
highly developedp]. Many AR type models such as the Auto Regressive Moving 
Average model, the Vector Auto Regressive model etc. have been proposed dur- 
ing the last twenty years. The AR type models have been applied successfully 
in diverse fields such as control of power plant, analysis of earthquake, price of 
stock market, etc. 

In this paper, we consider the AR type models as effective theories for these 
complex systems. The effective theory is closed among macro or slowly changing 
variables. We obtain it by integrating out the micro or quickly changing vari- 
ables. The theory of the Langevin equation is a typical effective theory, in which 
the effects from quickly changing variables are treated as white noise. Coupled 
discretized Langevin equations govern the AR type models. We assume that the 
white noise in the AR type model is due to dynamical effects of the underlying 
quickly changing variables and measurement error is relatively small. 

Recently, Sekimoto and Sasa reconstructed thermodynamics in a model of 
molecular machinery governed by Langevin equation | 2 | . Let us consider a simple 
example, a harmonic oscillator coupled with a heat bath whose temperature is 
T. The spring coefficient a(t) of the oscillator can be externally changed in time. 
The system is governed by the following Langevin equation, 

X = —a * X + T^{t) 

where x is position of the oscillator and ^(t) is thermal noises from the heat 
bath. We neglected the term of inertia. 

In Sekimoto-Sasa theory, we interpret as follows, (l)Langevin equation is 
balance one of forces. (2)Heat is work by reaction force to the heat bath. 
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Then, we can derive the conservation la'w of energy or the fast la'w, 

< Q > +a(t)* < > /2 \q = f dtd(t) < > f2 

Jo 

■where the right hand side is the work by the change of the parameter a(t). 

In the case of the slowly varying parameter, we can expand the heat < Q > 
as, 

< Q >=< Qr> + < Qir > + •••. 

The reversible heat production, < Qr >, is proportional to a(r) — a(0) and 
the irreversible one, < Qir >, is proportional to da{t)/dt, respectively. The 
positivity of the irreversible heat production is corresponding to the second law 
of thermodynamics. 

In this paper, we propose “thermodynamics” constructed by applying Seki- 
moto-Sasa theory to the Auto Regressive type models for nonphysical systems, 
such as economical and biological ones. In such systems, the random fluctua- 
tion is no longer thermal. “Temperature” for noise source is associated with the 
strength of the random fluctuation. The time derivative is caused by “force” . 
Then, “thermodynamical” laws, “energy” conservation and positive “entropy” 
production appear with complete new interpretations in terms of economy and 
biology. For plural noise sources, “Fourier law”, flow is proportional to “temper- 
ature” difference, also appears. 

The detailed balance is generally broken in nonphysical systems. The equilib- 
rium distribution can not exist in the original frame. The irreversible circulation 
of fluctuations appear and the equilibrium distribution and “potential en- 
ergy” can be defined in the rotating frame. We applied our arguments to the 
zero-power point reactor kinetics model ^ and obtained the irreversible circula- 
tion of fluctuations. We are trying to derive new “thermodynamical” laws from 
data from an actually test nuclear reactor. 
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1 Introduction 

WWW is a huge database. Discovering a new knowledge from such a database 
is an important theme. Search engines are commonly used and clustering search 
results are proposed in However, those information are used temporarily 
and only kept in user’s bookmark. The bookmark does not represent the whole 
structure of user’s knowledge. Forming a new knowledge structure is much harder 
and valuable. 

This paper describes the system KN(Knowledge Network) which visualizes 
the linkage structure of a user’s personal knowledge of URLs. It is an extension 
the system in [3j . A personal knowledge of URL-links is drawn as a directed graph 
called as a Web graph. Some softwares display the static structure of a particular 
web site. The linkage information on WWW are also studied (e.g.p®^). Our 
database keeps not only general linkage information, but also personal knowl- 
edge. The system supports personal and progressive linkage database. 

Our system stores background knowledges as Web graphs. Given a back- 
ground knowledge and some keywords, the system gains new URLs from a search 
engine with the keywords and places them on the given background knowledge. 
A new URL is placed around an old URL when they are directly linked. In 
this way, a user can develop his knowledge and can keep the new knowledge as 
another background knowledge. 

2 Knowledge on WWW 

The system consists of background knowledges and a URL database (Fig. 

The database, given a URL, returns a list of URLs to which the given URL 
has links. A user develops a background knowledge with search engines, a list of 
URLs, and Web browsing. A new node is placed near to a node to which the new 
one linked. The new Web graph can be stored as a new background knowledge. 

Fig. El describes the linkage structure of the URL at the site of FreeBSD 
(http://www.jp.freebsd.org/links) and its development with the keywords 
“gnome” and “Japanese”. Gray nodes are URLs in the background knowledge, 
that is, the URLs in the above URL. New nodes obtained by a search engine is 
colored in white. The white node “109” is placed near the gray node “841” to 
which “109” links. 
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Fig. 1. Outline of the sys- 
tem KN. pjg_ 2. The Web graph developed by a 

keywords search. 



3 Conclusion 

We presented the system KN which visualizes Web graph. The graph is personal 
and progressive. Each user develops interactively his knowledge with the system. 

URLs obtained by a search engine has almost no connections each other. 
But, with an adequate background knowledge, we have a Web graph in which 
some URLs have connections. Moreover, when we get a URL from our system, 
we easily see the structure around the URL. 

Increasing background knowledges and the contents of URL database is an 
important future work. It is also a future work to build a system that enables a 
user to add some information to a URL. For example, keywords used in a search 
is an important and very personal property for the URLs. 
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1 Introduction 

We have extended Basket Analysis to transaction data having graph struc- 
tures in our previous workpj. However, the extension is applicable only to the 
cases where all nodes contained in a graph are mutually distinct. Though this 
is suitable for the analysis of graph structure such as browsing patterns among 
URLs, it can not mine frequent patterns where multiple nodes of the same kind 
appear in a graph (e.g., chemical molecule structure). We propose an extension 
of Apriori algorithm to mine this type of graph structure. We evaluate the per- 
formance of our method in terms of the required computation time for various 
factors. 



2 Extension to Graph Structured Data 



The graph structured data can be generally transformed without much compu- 
tational effort into the form of an adjacency matrix which is a very well known 
representation of a graph in mathematical graph theory. A node which corre- 
sponds to the i-th row and the i-th column is called the f-th node, and the 
number of the nodes contained in a graph is called size. We represent a, k x k 
adjacency matrix of a graph as A^, the ij-element of Xk as Xij and its graph 
as G{Xk). The objective task is to mine the graph structured data which do 
not have any direct loop link from a node to the same node and whose nodes 
do not have any attributes(labels). We consider one graph as one transaction, 
and propose an algorithm to efficiently derive frequent patterns contained in 
many transactions. We define the support” and ” minimum support” similarly 
to those of the conventional Basket Analysis, and call a graph having support 
value greater than minimum support a ” frequent graph” similarly to a frequent 
itemset. 

We extend Apriori algorithm as follows. If G{Xk) and G{Yk) in a set of 
frequent graphs have equal elements of the matrices excluding the elements of 
the fc-th row and the fc-th column, then they are joined to generate Zk+i- 
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Xk = 




Xk-i Xi 
X 2 0 Zk,k +1 

V 2 ^k-\-l,k 0 



where Xk~i is the adjacency matrix representing the graph of size k — 1, Xi 
and = 1,2) are vectors. Furthermore, when they are joined, two adjacency 
matrices whose {k, k + l)-element and (fe + 1, /c)-element are ” 0” and/or ” 1” are 
generated. We call and the first matrix and the second matrix to generate 
Zk+i respectively. In the conventional Basket Analysis, items within an itemset 
are kept in lexicographic order. However, the nodes contained in a graph do not 
have such order, as nodes in the data do not have any attributes. So we define 
the order of adjacency matrices as follows. In the case of undirected graph, the 
code code{Xk) of an adjacency matrix X^ is represented in the binary digits 
as in equation(Epl by arranging the upper triangular elements of Xk, except its 
diagonals. 



In the case of directed graph data, the adjacency matrix of directed graph 
is represented as a code represented by base 4 similarly to the binary code of 
the undirected graph. Furthermore, iff the first matrix and the second matrix 
to generate Zk+i satisfy equation©, the two adjacency matrices are joined. We 
call the adjacency matrix generated under this condition a normal forwH' , and 
the non-normal matrices are not generated. 



As there are adjacency matrices which represent the same graph structure in 
a set of the normal form matrices, we define the matrix whose code is the least 
among matrices representing the same graph as a ” canonical forni’’ . 

3 Performance Evaluation 

The performance of our proposed method has been examined though artificially 
generated graph structured transactions. First, the number of nodes in a graph 
is specified. Next, a link between every pair of nodes is randomly given by a 
certain probability. Finally, four basic subgraphs patterns of size 4 are prepared, 
and one of them is randomly over- written on each transaction. Table Qshows the 
effect of the minimum support. The computation time of ’’generate candidate” is 
the time required to derive all candidate graphs and to search for the canonical 
form, and that of ’’count support” is the time to scan database and to count the 
support values. As shown in Table 0 when the minimum support is decreased, 
the computation time and the number of frequent graphs are increased. 



code(Xk) = xi, 22:1, 33:2, 33:1.4 • • • Xk-2,kXk-i,k- 



( 1 ) 



code(the first matrix) < code(the second matrix) 



(2) 
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Table 1. Comp, time for various values of minimum support for directed graph 



minimum 
support [%] 


Num. of 
freq. graphs 


Max. size of 
freq. graphs 


Computation Time[sec] 


generate candidate 


count support 


total 


10 


235 


4 


66.2 


18.8 


85.0 


30 


213 


4 


56.4 


18.1 


74.5 


50 


153 


4 


23.3 


14.3 


37.6 


70 


21 


4 


0.5 


5.6 


6.1 



The number of transactions is 1000. The size of transactions is 10 
and probability of link existence is 50%. 
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In order to dissolve or ease retrieval difficulties on bibliographic databases, we 
have been developing bibliographic navigation system with the implementation 
of our proposed mining algorithms [1]. Our navigation system shows related key- 
words derived from the query which is inputed by a query user, and navigates 
query users to retrieve appropriate bibliographies. Although those thresholds 
that are used in the mining association algorithm are usually given by the sys- 
tem administrator, it is required methods to give such thresholds that can derive 
appropriate association rules for bibliographic navigation system. In this paper, 
we propose a method which specifies the optimal thresholds based on the ROC 
(Receiver Operating Characteristic) analysis [2] and evaluate the performance of 
the method on our practical navigation system. 

According to the bibliography [2], ROC graphs have long been used in signal 
detection theory to depict tradeoffs between hit rate and false alarm rate. ROC 
graphs illustrate the behavior of a classifier without regard to class distribution 
or error cost, and so they decouple classification performance from these factors. 
The ROC convex hull method is a method to compare multiple classifiers on 
an ROC graph and specify the optimal classifier which supplies the highest 
performance. ROC graph uses two parameters true positive rate TP and false 
positive rate FP as classifiers. If is plotted on the X axis and TP is plotted 
on the Y axis on a graph for several instances, then a curve is drawn and the 
curve, which is called as the ROC curve, drown nearer the point on which TP 
is higher and FP is lower, that is the most-northwest line, is better. 

Although ROC graph illustrates classification performance separated from 
class and cost, the ROC convex hull method can consider them. It is assumed 
that c{classification^ class) is a two-place error cost function where c(n, P) is 
the cost of a false negative error and c{y, N) is the cost of a false positive error, 
and p{P) is the prior probability of a positive instance, so the prior probability 
of a negative instance is p{N) = 1 — p{P). So the slope of an iso-performance 
line can be represented by p{N)/p{P) ■ c{y, N)/c{n, P). 
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Table 1. Minsups at Rerror = 145 and the average distances from the point (1, 0) on 
the ROC graph. “AllPos” means deriving all and “AllNeg” means deriving nothing. 



Category 


p{N)/p{P) ■ 1/Rerror 


Optimal Minsup 


Minsup = 0.08 
ROC distance 


ROC Algorithm 
ROC distance 


1 


0.0000 ~ 0.2211 


AllPos ~ 0.02 


0.8211 


0.9477 


2 


0.2211 ~ 0.7139 


0.02 ~ 0.04 


0.8940 


0.9725 


3 


0.7141 ~ 2.2706 


0.04 ~ 0.25 


0.9119 


0.9008 


4 


2.2728 ~ 7.1847 


0.25 ~ 0.40 


0.9322 


0.9857 


5 


7.2075 ~ 22.565 


0.40 ~ 0.60 


0.9926 


0.9976 


6 


22.790 ~ 69.076 


0.60 


0.9929 


0.9968 


7 


71.235 ~ 207.24 


0.60 


1.0262 


1.0001 


8 


227.97 ~ 569.93 


AllNeg 


1.0159 


1.0000 


9 


759.91 ~ 1139.9 


AllNeg 


1.0351 


1.0000 


10 


2279.7 


AllNeg 


1.1413 


1.0000 



We adapt the ROC analysis to bibliographic navigation system by giving 
TP =1 B n \JT =1 R-i I / I B I, PP =1 B n U^i Rj I / I B I, where U is the set 
operator of union, P| is the set operator of intersection, 1 1 is the set operator to 
find the item number of sets, B is the bibliography set covered by a query and Rj 
is the bibliography set covered by Vj which is the j’th keyword derived from the 
query. And applying the ROC convex hull method [2] to the ROC graphs, specify 
the best classifiers, that is Minsup, fitting to p{N) /p{P) ■ 1 j Rerror- Where Rerror 
is c{n, P)/c{y, N) and the error cost ratio of false positive and true negative so 
that the more expensive Rerror is, the fewer retrieval omission is. And p{N) /p{P) 
is represented by|U| — |B|/|B|, and | U | is the number of all bibliographies 
in the retrieval and | B | is given as the hit count retrieved by the query. Hence, 
Minsup can be decided by the hit count retrieved by a query. 

In order to evaluate the performance, we treat 330,562 titles published by 
INSPEC in 1997 from our practical navigation system. Thus, the number of all 
bibliographies in the retrieval | U | is 330,562. And we categorize keywords into 
ten classes by frequency in order to draw ROC curves by Minsup as classifier. 
As the results shown in Table Q optimal Minsup was decided for each query to 
derive keywords as association rules. “ROC distance” in Table Qis the distance 
between (PP, TP) and the point (1,0), and the longer distance shows higher 
performance on a ROC graph. Therefore, our ROC algorithm can derive better 
rules than the mining association algorithm using fixed Minsup. Using the ROC 
algorithm, we get derived keywords which are guaranteed the performance by 
the ROC analytically. 
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When we find the best linear approximation for the task y=X/3+e we usually use 
the least squares method (hereinafter LSM). Vector of factors (/3) is calculated 
in this case by the formula (1). It is possible to consider the same task with the 
l-y (7 = 4, 8 , oo) metrics in spaces. In this case we should solve the task (2). 

(3 = (X^X)-ixTy ( 1 ) 

l|e|lb(^=4,8,oo) ^ min/3 ( 2 ) 

If we have to avoid the single large ejection of magnitude |yj— /3-xj|, we have to 
use models, constructed with 1.^(7 = 4, 8 , 00 ) metrics in spaces. Let’s consider 
weighed model. Magnitude w^ = 1 + (yj — /3 • x^)^ is used as a weight of the 
observation j in the training set. Vector of factors (/?) is calculated by the for- 
mula (3 = (X'^WX)“^X"'"Wy where W = diag(w^, w^, . . . ,w^) is the matrix of 
weights. Diminishing of maximum error is the main advantage of the method of 
weighing observations. If the matrix X"'"X is not singular we can introduce the 
iterative process J 1 as follows: 

1 . Wo=En 

2. For k = 0, 1, ... we find vector of factors /3k = (X'^WkX)“^X'^Wky 

3. Wk+i = Wk + diag((fk - y)^™),Vm < 00 

4. If k<K, for beforehand known K, we pass to the step 2, otherwise we finish. 
Let’s introduce the iterative process J2 in the same assumptions as follows: 

1 . Wo=En 

2. For k = 0, 1, ... we find vector of factors /?k = (X'^WkX)“^X'^Wky 

3. m=argmaxj(|4 - yj|) 

4. Wk+i = Wk + diag(0, 0, . . . , 0, 1, 0, . . . , 0, 0), where unit is on m place. 

5. If k<K, for beforehand known K, we pass to the step 2, otherwise we finish. 
Now we can formulate two theorems (without proof). 

Theorem 1. The iterative process J1 at k— > 00 is equivalent to the solution 
of the task ||e|| — > min /3 with the l 2 m +2 metrics in space. 

Theorem 2. The iterative process J2 at k^ 00 is equivalent to the solution 
of the task ||e|| — > min /3 with the loo metrics in space. 

The set of values P = (3 ■ y3 becomes known after LSM construction. One 
more set of values = (3^ ■ x^ becomes known after using the procedure of 
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sliding monitoring (hereinafter PSM). Average value of factors was taken as the 
model factor. Root-mean-square deviation of factors was taken as the error in 
an evaluation of the model factor. Then we consider hypothesis Hq : {/3i = 0 
for i = 0, . . . , P} against an alternative hypothesis Hi : /3j 0. One of possible 

statistics for a criterion will be F = [/3i/(5/3i]^. It has distribution F with 1 and 
N-P degrees of freedom for Hq. We suppose that the best regression model is 
that model, which has maximum binary correlation coefficient between the set 
of observable values {yl} and the set of values {g^}, calculated by the PSM. 
The main goal of this work is the generalization of theoretical results to the 
models constructed by using the neural networks. The statements of theorems 
can be transferred on the case of neural networks. For model quality control 
we used such criteria, as square of binary correlation coefficient, mean absolute 
error (Er), mean square error (Er^), maximum error (Er™“ = maxj|F — yl|) 
and amount of errors (Dif). Also we used such criteria, as amount of alpha 
errors (Di) and amount of beta errors (D2). The criteria were applied both to 
a set of pairs {ybP} and to a set of pairs {ybg-^}. Experimental verification 
was carried out on the sample given by the laboratory of cytochemistry and 
diagnostic researches of the RAMS Pediatric institute. The sample consisted of 
42 observations and described by 11 input variables. The main purpose of this 
research was in prediction the presence or the lack of mother exacerbation after 
the childbearing. Dependent variable was coded as -1 and -|-1. We considered 
11 indexes of the cytochemical analysis of the woman’s blood lymphocyte as 
independent variables. The factors of the LSM and the LSM with regularization 
of the solution (LSM_R) were obtained by the formula (1). The factors of the 
model constructed with metrics I4 (Is) in space, denoted L4(Ls), and the model 
constructed with metrics I4 (Is) in space with regularization of the solution, 
denoted L4_R (Ls_R) were obtained from the formula (2) with y=4 (7=8). The 
factors of the model constructed with metrics loo in space (Loo) and the model 
constructed with metrics loo in space with regularization of the solution (Loo-R) 
were obtained using the iterative process J2. In the table #1 is assembled the 
information about quality of all constructed models. Based on this information 
we can carry out comparative analysis of considered models. Amount of remained 
variables is denoted in column ”Var”. 
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Lines in the table are located in order of (PSM) increasing. Worse models 
have appeared at the top and best in the bottom of the table. Use of metric 
spaces with large exponent makes the model more qualitative on ability of correct 
prediction. Regularization of the solution should be used necessarily. 
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1 Motivations and Purposes 

The paradigm of agent-oriented computing is based on a societal view of com- 
putation, where social concepts play important roles in computation, such as 
negotiation, compromise, cooperation, coordination, collaboration, conflict res- 
olution and management, and so on. 

In this paper, we propose the argument-based intelligent agent systems where 
several agents communicate, argue with each other and finally make a decision 
through argumentation from knowledge bases distributed or dispersed on the 
network. The reason why we have taken the argument-based approach to agent- 
oriented computing is based on such a recognition that (knowledge) information 
is essentially incomplete, uncertain, subjective, contradictory, and distributed. 
In the upcoming information society with information of such a nature, where 
there seems to be no specific and definite rules and methods for making decisions, 
the only thing we can do firstly is argumentation with the persons concerned. 

We demonstrate two argument-based agent systems, which have been built 
in the different approaches. Through the demonstration, we claim that argumen- 
tation is a basic process for not only attaining a consensus but also discovery 
or creativity in terms of distributed intelligence on the network (see p] for the 
detailed discussion on this). 

2 Outline of Demonstration 

2.1 Argument-Based Agent Systems: A Brief Description 

In modeling argumentation, we have paid much attention to the following issues: 
(i) Knowledge base may be distributed and non-sharing in general, (ii) knowledge 
base may be inconsistent, but an agent always exploits a consistent subset of 
her/his knowledge base, (iii) each agent has her/his own argument strategy as a 
proper way of reasoning, and (iv) not only two agents (one proponent and one 
opponent) but also several agents can attend to an argument. 

The object language for knowledge representation is based on the extended 
logic programming with negations allowed to appear in the head and body of 
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a clause. An argument is a proof tree and then a head of an argument by an 
agent can rebut the nodes of other arguments if the head conflicts with those 
nodes. The arguments and counterarguments mutually proceed, following the 
argumentation protocol given. 

We will demonstrate the following two argument-based agent systems, which 
have been built in the different proper approaches. One is based on our paper 0 
and the other is based on 0. 

System 1: This argument system has two kinds of negations: weak negation 
and strong negation that allow for natural representation of incomplete informa- 
tion. Furthermore this argument system has a defeat relation among arguments 
0, by which a judgement is automatically given. 

System 2: In this system, to rules are assigned plausible values and the plau- 
sibility of a whole argument are calculated in a reasonable way. Then, the special 
agent called a judge agent controls and mediates arguments among agents, using 
various criteria for superiority of arguments. As an argumentation protocol, it 
makes effective use of the contract net protocol. 

The both systems have been implemented in Java for the part of the com- 
munication control among agents, and in Prolog for the part of the construction 
of arguments and counterarguments. 

2.2 Argument Examples 

The demonstration will be done by letting agents argue with other agents on the 
remote computers. We then use the following realistic argument examples: (i) 
The first one is concerned with the real dispute on the issue: Is a nuke necessary 
or not at a local town of Japan?, (ii) the second one is concerned with a dispute 
in designing software systems like our argument systems themselves. They all 
are not easy for us to foresee which side is predominant, immediately from the 
knowledge bases themselves. In (ii), we also show how we are able to reach a 
dialectical agreement j^. 

Acknowledgement 

This research has been supported by the Kayamori Foundation of Information 
Science Advancement. 



References 

1. Maeda, S., Guan, C. and Sawamura, H.: An Argument-based Agent System with 
the Contract Net Protocol, 1999. (to be presented at First Asia-Pacihc Conference 
on Intelligent Agent Technology(IAT’99)) 

2. Sawamura, H., Umeda, Y. and Maeda, S.: Dialectical Logic as a Logical Basis for 
Argumentation and Discovery, with Applications to Argument-based Agent System, 
1999. (unpublished manuscript) 

3. Umeda, Y. and Sawamura, H.: Towards an Argument-based Agent System, 1999. 
(to be presented at Third International Conference on Knowledge-Based Intelligent 
Information Engineering Systems(KES’99)) 



Graph-Based Induction for General Graph 
Structured Data 



Takashi Matsuda^, Tadashi Horiuchi^, Hiroshi Motoda^, Takashi Washio^, 
Kohei Kumazawa^, and Naohide Arai^ 

^ I.S.I.R., Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka 567-0047, JAPAN 
^ Recruit Co., Ltd., 8-4-17 Ginza Chuo-ku, Tokyo 104-8001, JAPAN 



1 Introduction 

A machine learning technique called Graph-Based Induction (GBI) efficiently 
extracts typical patterns from a directed graph data by stepwise pair expansion 
(pairwise chunking) . We expand the capability of the GBI so that it can handle 
not only a tree structured data but also a graph data with multi-inputs/outputs 
nodes and loop structure (including a self-loop) which cannot be treated in the 
conventional way. We show the effectiveness of our approach by applying to the 
real scale World Wide Web browsing history data. 

2 Graph-Based Induction for General Graphs 

The original GBI was so formulated to minimize the graph size by replacing 
each found pattern with one node that it repeatedly contracted the graph. The 
graph size definition reflected the sizes of extracted patterns as well as the size 
of contracted graph. This prevented the algorithm from continually contracting, 
which meant the graph never became a single node. Because finding a subgraph 
is known to be NP-hard, the ordering of links is constrained to be identical if 
the two subgraphs are to match, and an opportunistic beam search similar to 
genetic algorithm was used to arrive at suboptimal solutions. In this algorithm, 
the primitive operation at each step in the search was to find a good set of linked 
pair nodes to chunk (pairwise chunking) jMotoda.OT) . 

In this paper, we expand the capability of the GBI to handle general graph 
structured data such as directed graphs including multi-inputs/outputs nodes 
and loop structure (including a self-loop). We propose an idea to perform pair- 
wise chunking without loosing the information of link connections. In order to 
apply GBI to general graph structured data, we adopt a method to represent 
the graph structured data using table forms by paying attention to link infor- 
mation between nodes. We introduce the “self-loop distinction flag” to identify 
self-loop when the parent node and child node are of the same kind {Eg. a a). 

Moreover, each time we perform the pairwise chunking, we keep link information 
between nodes in order to be able to restore the chunked pairs to the original 
patterns. The basic algorithm of the proposed method which extends GBI to 
handle a general graph structured data is shown in Fig.l. In this implemented 
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program, we use the simple “frequency” of pairs as the evaluation function for 
the stepwise pair expansion. The method is verified to work as expected using ar- 
tificially generated data and we evaluated experimentally the computation time 
of the implemented program. The computation time for 30,000 times repetition 
is shown in Fig.2 for three kinds of graph structured data (Data4: loop type, 
Data5: lattice type, DataG: tree type) for which there are three kinds of node 
labels. From this figure, it is found that the computation time increases almost 
linearly with the number of chunking. Table 1 shows the preliminary result for 
the classification problem of promoter DNA sequence data (total 106 cases). 






Fig.2. Computation time 
Table 1 Experimental result 



Fig.l. Proposed algorithm 
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3 Application to WWW Browsing Histories 

The performance of the proposed method has been examined through a real scale 
application. The data analyzed is the log hie of the commercial WWW server of 
Recruit Co., Ltd. in Japan. The URLs on WWW form a huge graph, where URLs 
represent nodes connected by many links. When a client visits the commercial 
WWW site, he/she browses only a small part of the huge graph in one access 
session, and the browsing history of the session becomes a small graph structured 
data. The total number of the URLs involved in this commercial WWW site is 
more than 100,000, and it is one of the largest site in Japan. Its total number 
of hit by the nation wide internet users always remains within the third place 
from the top in every month in Japanese internet record, and the typical size of 
the log file of the WWW server for a day is over AQQMB. 

As the log file consists of the sequence of the access records, they are initially 
sorted by the IP addresses, and then we transform the subsequence of each client 
into the graph structured data (total 150,000 nodes). After this preprocessing, 
we executed the proposed method using the frequency of pairs as the evaluation 
function. When we use the frequency threshold 0.1%, 0.05%, 0.025% of the total 
nodes, the number of derived chunk patterns results in respectively 33, 106, 278. 
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We could extract some interesting browsing patterns of many clients such as 
a) clients follow some URLs in the same directory, b) clients go deep into the 
directories one after another, c) clients jump to the URLs in a different directory 
after following some URLs in the same directory. 
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1 Introduction 

Classification is one of the most important consideration of the data mining prob- 
lems. The measure of success in a classification problem is the accuracy of the 
classifier, usually defined by the percentage of correct classifications. Recently, 
neural networks have been thought as one approach to solve this problem, but 
the difficulty is that the classification rules generated by neural networks are not 
explicitly represented in the human understanding form. 

The size and speed are two matters when one wants to solve an actual prob- 
lem by neural networks. Usually, the number of hidden units in multilayer neural 
networks is one of the most important considerations to determine the size of 
the neural network. There are many methods to reduce the structure of the net- 
works such as destructive, constructive, and genetic algorithm (Weymaeve and 
Martens, 1994). 

Constructive or growth methods start from a small network and dynamically 
grow the network (Giles, chen. Sun, Chen, Lee and Goudreau, 1995). Construc- 
tive learning algorithms were used to handle multi-category classification with 
convergence to zero classification errors (Parekh et ah, 1995). The advantage in 
using this method is that it can automatically find the size and the topology of 
the neural network without specifying them before training. In our paper (Mo- 
hamed, et al. 1998), we proved that the convergence rate of output error is not 
depending on the connection weight of the unit which is added to the hidden 
layer with the input layer after the network was learnt. 

In this paper a constructive learning algorithm of three layered feedforward 
neural network is proposed to train the network under supervised conditions of 
the output layer and penalty conditions of the connections between the layers. 
The supervised conditions are used not only to increase the accuracy but also 
reduce the training time. The penalty conditions are used to reduce the size of 
the network by removing unnecessary connections and preventing the weights 
from taking large values. 

2 Rules Extraction Algorithm 

Given data tuples, each of which consists of a number of attributes, and the 
target function that determines the class number for each tuple. The values of 
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these attributes can be generated from the available data. From our point of 
view, much more attention should be paid for the following items in finding 
classification rules from the network: 

1. Learning the network by minimizing the error function with supervised con- 
dition and penalty conditions. 

2. Removing unimportant connections after learning the network. 

3. Clustering the hidden-unit activation values into a manageable number of 
discrete values without sacrificing accuracy of the network. 

4. Discovering the feature relations among the hidden units activation values 
with both the output values and the input values. 

5. Extracting the classification rules between input units and output units. 

In the experiment we focus on extraction rules on the problem which was 
defined by (Agrawal, et ah, 1993). The number of training data set is 1000 and 
the same number for testing set. Both training and testing data were generated 
randomly based on the description values of the attributes. 

The network after removing unimportant connections contains only a few 
units in the input layer. We continued removing connections from the network by 
pruning algorithm as long as the accuracy still high. The classification accuracy 
was high even after pruning the network and hidden-unit clustering process. 
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Introduction 

In a given dataset, we regard each pair of numeric at- 
tributes as a two-dimensional attribute. A rule which 
classifies tuples according to a judgment whether or not ^ 
a tuple has a particular value of a two-dimensional at- 
tribute is called a region rule. A region rule associated 
with two attributes x and y is depicted on the x-y plane 
as shown in Fig. 0 

Fig. 1: Region Rule. 

In the previous study, we proposed weighted majority decision among several 
region rules to classify numeric datasets, especially focusing on the readability 
of the obtained knowledge |C]. We generalize the strategy in order that it can 
cope with categorical datasets. 




Region Rules for Categorical Attributes 



First, let us consider one-dimensional rules on a single categorical attribute. In 
relation to a single categorical attribute, we can split the set of tuples according 
to whether or not each tuple belongs to a particular subset of the categorical val- 
ues. Suppose that a categorical attribute takes values in the set jci, C 2 , . . . , Ck}- 
There is an 0(fc)-time algorithm that generates a subset which minimizes the 
entropy of splitting. During the generation of the optimal subset, it checks only 
k subsets instead of all the possible 2^ subsets [ 0 . For a tuple t, let t[C] denote 
the value of an attribute C, and let W denote the objective attribute taking 
values 1 and 0 (correspond to true and false, respectively). For each c,, let /i, 
denote the average of W’s values of all the tuples whose values of attribute C 



are Cj: 

_ Ht[C]=Ci 

\{t\t[C]=c,}[ 



( 1 ) 



We rename and order the categorical values so that the averages /i, meet the 
condition < /i 2 < ■ ■ ■ < Mfe- The algorithm guarantees that the optimal subset 
is the form of {cj|l < f < j}. 

We then generalize the idea of one-dimensional rules to create two-dimensional 
region rules on categorical attributes based on the above property of categorical 
attributes. The order among categorical values (ci < C 2 < • • • < Ck) defined by 
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Fig. 2. A region rule and a decision tree for categorical conditional attributes. 



the averages enables us to generate two-dimensional region rules using cate- 
gorical attributes. Once the categorical values are ordered, the generation of the 
region rules is the same as that for numerical attributes. 

Fig.H shows a small example. Fig. presents a region rule for the colic 
dataset obtained from the UCI Machine Learning Repository |^. The values 
of two attributes and A 23 ) are ordered and renamed as above (in both 
attributes, we assigned an additional attribute value for missing data). In each 
grid cell, attached values “N/M” show the numbers of the tuples whose objective 
attribute is 1 and 0 in a test dataset, respectively. We can observe that the gray 
region sorts out the tuples whose objective attribute is 1. On the other hand. 
Fig. 03 shows a decision tree for the same training dataset generated by the 
See5 program 0. The nodes at depth 1 and 2 correspond to the region rule. 
For the same test dataset used for the above region rule, the error rate of this 
decision tree is the same as that of the region rule. 

Parallelization 

We parallelized the generation of region rules for the majority decision on two 
shared-memory parallel computers: SGI Origin2000 (I28CPUs) and Sun Mi- 
crosystems Enterprise 10000 (64CPUs). We will give an on-line demonstration 
of our system. 
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Abstract. Many GP learning methods have been proposed to decrease 
node combinations in order to keep the node combinations from explo- 
sively increasing. We propose a technique using an opposite approach 
which tests a greater number of combinations in order to decrease the 
chances of the search being ‘trapped’ in a local optimum. In the proposed 
technique, how ‘different’ the individual structure is is used as an index 
in selecting individuals for genetic operations. Therefore, variety in the 
GP group is strongly maintained, and it is expected that GP learning is 
always done to a new combination. 



1 Purpose 

In general, when there are a large number of node types defined, this may cause 
the Genetic Programming (GP) pp learning speed to become very slow, or cause 
the GP to fail to reach the global optimum solution. This is caused by the 
explosive increase in combinations. For this problem. We propose a technique 
using an opposite approach which tests a greater number of combinations in 
order to decrease the chances of the search being ‘trapped’ in a local optimum. 
In the proposed technique, how ‘different’ the individual structure is is used as 
an index in selecting individuals for genetic operations. Therefore, variety in the 
GP group is strongly maintained, and it is expected that GP learning is always 
done to a new combination. 

2 Method 

In normal GP, genetic operators (crossover and mutation) are operated based on 
the individual fitness. In the proposed technique, genetic operators are operated 
based on the evaluation of the tree difference. As for the difference of the tree, 
two types of comparisons can be made. One is tree structure difference between 
two individuals. The other is difference in each individual’s composed node types 
and numbers. In this paper, we use the difference in composition node types and 
numbers as the tree difference. Following, the evaluation of the difference of the 
tree was defined by the following expression. 

^types 

fdiff{A,B)= \rii{A) - n,{B)\ 
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fdiff{A, B): Evaluation value of difference in tree structures A and B 
K types- the total number of defined node types 
ni{»)\ the number of i-th node type used in • 

The proposed evaluation value is built in to the genetic operator in GP. In this 
paper, we used crossover as the modified genetic operator. In normal crossover 
operation, two target individuals are chosen based on the individual fitness. In 
the crossover operation modified by the proposed technique, one individual is 
chosen based on the individual fitness, and another individual is chosen based 
on the evaluation value of the difference of the tree structure with the first one. 

3 New Aspect of Work 

In previous GP techniques, when there are a large number of node types defined, 
the search efficiency is raised by decreasing the examined node combination EEl . 
However, this approach may cause the search to find only a local optimum so- 
lution and fail to reach the global optimum solution. In the modified GP using 
the proposed evaluation value, the examined combination has effectively been 
increased, allowing a better probability for the GP to reach the global optimum 
solution. 

4 Results 

To verify the validity of the proposed method, we developed a rule generation 
system from a medical database. We compared the result of the proposed method 
with normal GP, using a database for the occurrence of hypertension. The pro- 
posed method was shown to give a more accurate rule set compared to normal 
GP. The proposed method also showed a greater improvement in fitness towards 
the end of the training cycle compared to normal GP. However, the number of 
generations until termination had increased for the proposed method. 

5 Conclusions 

The proposed method allows GP to strongly maintain variety within the GP 
group. Therefore when there are a large number of node types defined, the 
proposed GP increases the probability to reach the global optimum solution. 
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1 Introduction 

Recently, many papers have been published in information visualization. These 
systems are, however, proposing a specific way of viewing a large set of data us- 
ing 3D graphics. Our research group defines information visualization not only 
as the 3D visual presentation of a large set of data or records, but also as an 
architectural design of an interactive 3D space in which we materialize a large 
set of data and records to interact with them. We propose a generic visualization 
framework that materializes records as interactive 3D objects. The component- 
ware architecture of IntelligentBox , developed by our group as a toolkit system 
for applications with interactive 3D graphics, provides the basis of this applica- 
tion framework. You can associate the parameters of the composite object with 
some attributes of the record. 



2 Proposed Frameworks and System Overview 

IntelligentBox is a component ware for developing 3D interactive applications. 
It calls components boxes. Boxes may have arbitrary internal functions as well 
as arbitrary 3D visual display functions. Different from other 3D application 
development toolkits, every 3D object in IntelligentBox has its own function, 
and reacts to user events. IntelligentBox provides a dynamic functional com- 
position mechanism that enables us to geometrically and functionally combine 
3D objects through direct manipulation. Only primitive component boxes need 
to be programmed. Boxes in IntelligentBox can represent not only computer 
animation graphics but also various application tools including information vi- 
sualization systems and scientific visualization systems, and allow us to integrate 
them. Each box is logically modeled as a list of slots. A box can be connected 
to a single slot of no more than one other box. The former becomes a child box 
of the latter, while the latter is called a parent box of the former. The child 
can access the connected slot of its parent by either a set message or a gimme 
message, while the parent box can send an update message to its child boxes. In 
their default definitions, a set message writes its parameter value into the corre- 
sponding slot register in the parent box, while a gimme message reads the value 
of this slot register. An update message tells the recipient that a state change 
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has occurred in the parent box. The developer may overload the definition of 
these procedures. 

The retrieval of records from a database is performed by a DBProxyBox 
working as a proxy of a database. Its slot list includes #query, ^search, ^insert, 
#delete, #result, #previousCandidate, #nextCandidate, and #currentRecord. 
When ^j^search slot is accessed by a set message, the DBProxyBox issues the 
query hold by #query slot to the Database. The retrieved result is stored in 
^result slot. The DBProxyBox has a cursor that points one of the records in 
^result slot. The pointed record is held by #currentRecord slot. The two other 
slots, #previousCandidate slot and #nextCandidate slot, when accessed by a set 
message, move the cursor back and forth. You may easily connect various boxes 
to this DBProxyBox with their connection to ^currentRecord slot to define a 
3D visual representation of each retrieved record, or with their connection to 
^result slot to visually present the 3D distribution of retrieved records (Fig.l). 





Fig. 1. Interactive animation of records in Databases. 



In this example, they use a 3D doll as a 3D representation template of each 
record. The height of each doll and the name label are associated with the two 
attributes of records. You may even model this doll to stamp and to change 
its pace, and associate this stamping speed with another record attribute. Since 
each 3D record representation is a composite box, it can also be modeled to 
respond to user events. This allows us to easily develop interactive information 
visualization systems. 

3 Concluding Remarks 

Different from the other information visualization systems, our system provides 
a generic framework for developing various different types of information visual- 
ization. Its visual object materializes a record as a directly manipulable object in 
an interactive VR environment. Users can make its copies to reuse in a different 
VR environment. 
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The necessity to deal with human motion data on computer is growing in the 
fields of movies, video-games, animations and so on. It is very useful to store 
the motion data in a database, since the data can be reused for these areas. 
On the other hand, many researches have studied to recognize human motion 
from 2-D images. Most of them have focused on the way of extracting motion 
information from 2-D images, and few of them have succeeded to recognize 
motion in general conditions. We deal with human motion data from motion 
capture system, which offers 3-D time series data. Each 3-D time series data 
represents motion information. 

If the indices for motion data are given manually, they may vary according 
to person’s point of view. Thus, we propose the method to extract primitive 
motions for developing a motion database system which allows users a content- 
based search. The extracted primitive motions are used to index motion data 
automatically so that the given index represents a part of motion, that is to say, 
contents. 

To achieve content-based search, data similarity is evaluated in a pattern 
matching approach. A lot of algorithms for pattern matching have already been 
proposed, but most of them deal with patterns as vector, where all data should 
be the same time scale or length. For example, consider two motion data that 
are performed by the same person in the same manner, each speed, course, or 
both of them may be different. Thus, we use Dynamic Time Warping (DTW)0 
method to solve this problem. This method can reduce the difference of speed 
between two data. 

Furthermore, human motion does not have a pre-defined pattern, although 
human voice can be segmented into a set of phonemes. Thus, our approach is 
as follows. First of all, our system divides all the motion data into segmental 
motions by detecting the change of motion speed j2). Next, it classifies similar 
segmental motion data into the same cluster by using the Nearest Neighbor 
algorithm with DTW for distance function. Each cluster will represent a typical 
motion, which we call primitive motion. 

By using primitive motions, we can distinguish meaningful motion data and 
noise motion data. The number of primitive motions will be finite, and combining 
some primitive motions will represent all the motions. If each primitive motion is 
indexed with the unique symbol, motion data can be represented as a sequence 
of indices. Thus, comparing two motion data can be reduced to compare symbols 
if a symbol is given to each cluster in advance. 
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Fig.n shows an example of motion data whose segments are associated with 
the symbols. Each letter represents an unique primitive motion. When a motion 
is given as input, our system analyzes it and converts it into a symbol sequence 
in terms of primitive motions. This conversion is executed independently for all 
body segments (like hand, elbow, knee and so on. In our experiment we attached 
16 markers to the human body). Experimental results show that 330 segmental 
motions are classified into 94 primitive motions. Each primitive motion takes 
less than two seconds. 




Fig. 1. An example of motion data. 

In addition, we propose a method to recognize human motion with symbol se- 
quences. After automatic indexing, we obtain a table of symbol sequences shown 
in Fig. 0 where Mmput represents a input motion and Mi, • • • , Mg represent the 
template motions. Then we compare these tables to recognize the motion. This 
table is very helpful, because some motion can be recognized by analyzing body 
segments separately. For example, if a motion of one genre is mainly character- 
ized in the arm’s action (like clapping hands, waving hands and so on), it can be 
recognized by being focused on only arms. This means that a motion data can 
be expressed by a few symbols, and if necessary, complex motion, such as clap- 
ping hands with walking, can be easily represented by combining arm and leg 
symbols. Furthermore, the required space to store tables is significantly smaller 
than 3-D time series data. 




Fig. 2. Overview of motion recognition. 
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Introduction When we search a large text collection for documents we want, we 
will specify some keywords and we obtain documents containing the keywords. 
Because the search result contains many documents, it is important to rank 
them. Though some methods are proposed for ranking, they did not consider 
positions of the keywords. As a result, such documents sometimes may be useless 
because the keywords appear in the same document by chance and each keyword 
has no relation to what we want to find. On the other hand, we can consider 
documents in which all keywords appear in the neighborhood as meaningful ones. 
It is therefore effective to rank documents according to proximity of keywords in 
the documents. This ranking is regarded as a kind of text data mining. In this 
paper, we show an algorithm for finding regions of documents in which all given 
keywords appear in neighboring places. This algorithm is based on the plane- 
sweep algorithm. It runs in 0{n log n) time where n is the number of occurrences 
of given keywords and it does not depend on the number of given keywords. We 
also show demonstrations of the algorithm using Web texts. 

Proximity search To settle the problem, we can use a strategy of finding docu- 
ments in which all keywords appear in the neighborhood. This is called proximity 
search. Though some algorithms were proposed for the problem, they find re- 
gions of documents which contain all specified keywords and whose size is less 
than a constant. Moreover, the result of the query contains meaningless regions, 
for example a region which contains another region containing all keywords. 

Baeza-Yates et al. j2j proposed an algorithm for finding pairs of two keywords 
whose distance is less than a given constant d in 0 {{mi+m 2 ) log toi) time, where 
mi < m 2 are the numbers of occurrences of the keywords. 

Manber and Baeza-Yates pj has proposed an algorithm for finding the num- 
ber of pairs of two keywords whose distance are less than d in 0(log n) time for 
n occurrences of keywords. However, this algorithm uses 0{dn) space. It is not 
practical for large d, and moreover it cannot be used for unspecified values of d. 

Though Aref et al. proposed an algorithm for finding tuples of k key- 
words in which all keywords are within distance d, it requires 0{n^) time. 
They suggested an algorithm using the plane-sweep algorithm of Preparata and 
Shamos ^ at the end of their paper, but any detail was not given. 

Above three algorithms assume that the maximum distance d of keywords is 
a fixed constant and they do not consider minimality of answers defined in the 
next subsection. 
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Our results In this paper, we propose /c-word proximity search for ranking docu- 
ments. It is an extension of the method to find regions of documents containing 
all specified keywords and is based on an idea of considering documents in which 
all k keywords appear in the neighborhood as useful ones. Such regions are as- 
sumed as summaries of documents, that is, the proximity search can be regarded 
as a kind of text data mining. Our algorithm finds intervals in documents which 
contain all specified keywords in ascending order of their size. Time complexity 
of the algorithm does not depend on the maximum distance of keywords nor the 
number of keywords k. As far as the authors know, there does not exist such 
algorithm for k > 2 keywords. 

We introduce the concept of minimality of intervals. An interval is called 
minimal if it does not contain other intervals which have all keywords. By ig- 
noring non-minimal intervals we can reduce the number of answers of a query 
to less than n, the number of occurrences of the specified keywords in the doc- 
uments. We propose an algorithm for finding minimal intervals containing all 
given keywords in order of their size. It is based on plane-sweep algorithm and 
runs in O(nlogn) time. 

Demonstration of our algorithm We implemented the algorithm and experi- 
mented on html files. Our system can be accessed from 

http://tiger. is. s. u-tokyo.ac.jp/odin/. The number of files is 51783 and 
the size of them is 185M bytes. We use the inverted indexes for finding positions 
of keywords. We store not document ID containing keywords but exact positions 
of keywords. We use a SUN UltrabO workstation (CPU UltraSPARC-II 360MHz) 
with 2GB memory. Searching time is less than 1 second even if the total number 
of occurrences of keywords is about two million. 
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Case based reasoning (CBR) is a well known framework to cope with ill-structured 
tasks, where no enough domain knowledge is available [^. The main objective 
of CBR is to address the knowledge acquisition bottleneck. Namely, in CBR the 
reasoner does not make effort to build an abstract model for domain knowl- 
edge to solve the problem, instead, during the problem solving, it relies on the 
past similar cases, and attempts to find the appropriate solution for the prob- 
lem at hand, by modifying the past similar solutions. However, CBR systems 
also require substantial knowledge acquisition effort (e.g. acquiring cases, case 
vocabulary, retrieval knowledge, adaptation knowledge This knowledge tra- 
ditionally is derived from a domain expert. Accordingly, although the expert can 
not propose an abstract model to support the domain, s/he attempts to define 
some regularities in the domain, that makes it possible to reason with the cases. 
In fact, the knowledge necessary to solve the problem has been depicted in the 
cases. The expert defines the skill of using the right case in the right place (simi- 
larity), and also defines a formalism to modify the old cases such that address the 
new problem (adaptation). Defining this knowledge, if precision matters needs 
an accurate model of domain, proposed by expert (which is normally not avail- 
able in CBR domains), and if precision is not assumed as a requirement, may 
fail to demonstrate acceptable reasoning skill. However, since in CBR there is 
no well-defined background knowledge, defining this knowledge by expert, con- 
strains the system with a static incomplete reasoning skill, that does not evolve 
with new experiences, even though this skill can be acquired through the new 
cases. Furthermore, adaptation is a compensatory part of CBR [^1 Adaptation 
process can compensate the shortcomings in other steps of reasoning process, by 
appropriately modifying a weak match to obtain a reasonable solution. Accord- 
ingly automating the process of acquiring adaptation knowledge, in this work, 
has been proposed in the means of increasing the reasoning skill. 

While other approaches for extracting adaptation knowledge merely attempt 
to find a way for modification [0, we define reasoning skill, against problem 
solving skill. We attempt to extract a base for assessing similarity and therefore 
extracting similarity from the adaptation rules. In brief we use rule extraction in 
means of increasing reasoning skill. Therefore the main objective is centralized 
on dynamic aspects. We have aimed at defining a formalism, to elicit the adap- 
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tation knowledge autonomously and independent from the expert. Therefore as 
the system gathers more cases, not only the solution for more cases would be 
achieved, but also the reasoning skill with these cases would become more ma- 
ture. Thus the problem solving skill increases exhaustively. 

Our objective mainly concerns two aspects, namely, when general rule about 
adaptation knowledge can be extracted in a known framework, and when there 
is not such a condition and the reasoner attempts to use opportunities to derive 
adaptation knowledge. While the former can be enriched through constraining 
the reasoner to extract the knowledge in the framework of substitional and trans- 
formational strategies, the latter applies in structural framework. Substitutional 
strategies generally refer to the methods that modify the solution of a case by 
substituting a part of that. Transformational strategies result in deleting a part 
of solution or adding a part to it. Structural adaptation however is when the 
reasoner applies some modification to the structure of a case (derivational anal- 
ogy is from this category 0). 

Learning in the light of predefined framework occurs off-line in the compile time. 
However, using opportunities to learn a new adaptation happens in the run time 
or the real problem solving time. 

Implementing this idea, we classify the cases in a number of classes base on 
the similarity in their solutions. We then find the adaptation rules for each of the 
derived classes. In the reasoning time we predict the class of the new problem, 
by Bayesian approach, and then apply the rules that hold for that specific class. 
Classifying the cases is based on the idea that, if a general rule that is globally 
true within the whole case base, can not be extracted, because of complexity 
of the domain, there might be possibility of extracting the local rules that are 
partially true and can support a part of the case base, which we refer to as a 
class. Namely, instead of extracting a global rule, we attempt to define multiple 
local rules, that each covers a part of our task. Nevertheless, by evolving through 
gathering more cases, these local rules can improve and become more reliable 
and certain. 
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1 Introduction 

Most of speech recognition systems are phoneme-based ones, however acoustical 
characteristics of phonemes are severely influenced by various factors such as 
phoneme contexts. In order to avoid such influences, there are several studies 
to find suitable processing units for speech recognition(for example, These 
methods, however, need to determine the number of units and the length of each 
unit in advance. 

In this paper, we propose an automatic acquisition method of speech recog- 
nition units based on Hidden Markov Network(HMnet). The new speech recog- 
nition units named “acoustic segments” is a sequence of several phonemes. Word 
and sentence models for recognition can be constructed by concatenation of the 
models, just like phoneme-based models. 

2 Outline of the Construction Algorithm 

The algorithm |2| consists of four steps. 

1. Training of initial model 

The initial model, which is a left-to-right type HMM with two states, is 
trained using all sentence samples. The model permits transitions from the 
final state to the first state only at phoneme boundaries of training samples. 

2. Acquisition of acoustic segments 

At first, each frame in a sample is assigned to a state of initial model us- 
ing Viterbi algorithm. And then, an acoustic segment is determined as a 
phoneme sequence which starts from the first state and ends in the final 
state. 

3. State splitting and re-estimation of the model 

The SSS-free algorithm|S|, which is one of the construction algorithm of 
HMnet, is applied to the model with the acoustic segments obtained at the 
previous step. The model does not permit transitions from final state to first 
state. 
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Table 1. Frequency of acoustic segments in training samples 



/a/ 


7.3% 


N 


5.5% 


/a r/ 


1.5% 


/N/ 


1.0% 


/i m/ 


0.9% 


/o m/ 


0.7% 


/e j/ 


0.1% 


/k/ 


7.2% 


/o/ 


5.3% 


/h/ 


1.4% 


/a. i/ 


0.9% 


/a n/ 


0.8% 


/m/ 


0.6% 






/i/ 


6.6% 






/Q t/ 


1.2% 


/u:/ 


0.9% 






/w a/ 


0.6% 


/e N m/ 


0.1% 


/e/ 


5.8% 


/c/ 


2.3% 


/o:/ 


1.1% 


/o r/ 


0.9% 


/o n/ 


0.7% 











Table 2. Distribution of length in obtained acoustic segments 



length 


1 


2 


3 


4 


5 


frequency 


64.1% 


31.1% 


4.2% 


0.5% 


0.03% 



4. Reconstruction of acoustic segments 

Acoustic segments are reconstructed as same way as Step 2. 

Step 3 and 4 are repeated until the number of states is reached the pre-defined 
number. 

3 Experiments 

We carried out experiments of model-construction. 100 sentences uttered by a 
male was used as training samples. The number of states was set to 500. 

310 distinct acoustic segments were obtained. Tabled shows an example of 
acoustic segments. Most of two-phoneme acoustic segments was vowel-consonant 
pattern such as /a r/. Table 0 shows the distribution of the lengths of acoustic 
segments or the number of phonemes in a acoustic segment. Most of acoustic 
segments consists of one or two phonemes. 

4 Conclusion 

We have proposed a method to acquire speech recognition units automatically. 
From the experimental results, vowel-consonant type acoustic segments were 
obtained. 
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Introduction. The automatic construction of classifiers is an important re- 
search problem in data mining, since it provides not only a good prediction but 
provides also a characterization of a given data in the form easily understood by 
a human. A decision tree P| is a classifier widely used in real applications, which 
are easy to understand, and efficiently constructed by using a method based 
on entropy heuristics P] . Fukuda et al. ^ have proposed an efficient algorithm 
(called DT in this abstract) for constructing a small and accurate decision tree 
with numeric attributes using optimized two-dimensional numeric association 
rules as node labels. 

A problem is that at each node, DT generates many rules for possible pairs of 
numeric and ordered attributes, but selects only one optimized rule among them. 
Since this generation is time consuming, the construction may be inefficient when 
there are many numeric and ordered attributes. A possible approach is to build 
a one-level decision tree such as IR [ 3 |. We take another approach to aggregate 
the decisions made by all generated rules. 



Weighted Aggregation Classifiers. In this abstract, we introduce weighted 
aggregation classifiers, which can be efficiently constructed as one-level decision 
trees but can provide highly accurate classification. Suppose that we have a set 
of all rules rj, f = 1, . . . , fc, generated from a dataset, which are associated with 
the parameter c] (c°) denoting the conditional probability of the target attribute 
is true given r* is true (false) on an instance x. Then, an weighted aggregation 
classifier (WA) is a collection H = {(rj, cj , c°, Wi) | i = 1, . . . , fc} of quadruples, 
where we associate with each rule r* in a real weight Wi representing its 
classification accuracy so that an accurate rule has a large weight. The decision 
H{x) is made by the majority vote over the classifiers: 

Wi ■ confi{:x.) > ^ w* • (1 - con/*(x)) , 

.2 i 
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Table 1. The results of prediction on UCI repository 0 Acc and Time are in 
% and seconds, resp, and an underlined entry shows the winner. 



Dataset 


Size 


AcCbase 


WA Acc 


Time 


DT Acc 


Time 


Breast Causer 


699 


65.52 


97.51 


237 


95.61 


666 


Liver Disorder 


345 


57.97 


67.83 


51 


50.46 


202 


Pima Diabetes 


769 


65.10 


69.40 


234 


69.92 


1216 


Balance Scale 


625 


53.92 


85.59 


31 


79.36 


106 


Titanic 


2201 


67.70 


77.60 


0.5 


79.05 


2 



Table 2. The results of High SBP prediction problem. Accuracies are measured 
in %. 



Attributes 


AcCbase 


WA AcCavr 


A.CCmax 


DT AcCavr 


AcCmaa; 


BMI 


60.00 


59.51 


63.42 


60.11 


63.93 


BMI+Others 


60.00 


70.71 


74.13 


62.92 


70.09 



where con/i(x) is cj if x satisfies c° otherwise, and [P] is the characteristic 
function of a predicate P. In the experiments, we set Wi = maxi{Pnt(ri)} — 
Ent{ri), where Ent(ri) is the entropy of on a dataset S. If there is only one 
rule ri then we set rci = 1. Since WA makes a decision considering more than 
one rules, WA would perform better than IR 0 when several attributes interact. 



Experiments. We implemented and compared WA and DT experimentally as 
follows. First, we run experiments on five benchmark datasets from UCI repos- 
itory j^. The accuracies are evaluated by two- fold cross validation, and timing 
are taken on Solaris 2.6, Ultra Sparc Hi 300MHz. In Table 0 we observe that 
WA produced more accurate classifiers in shorter time than DT does. 

Secondly, we run experiments on a large real dataset consisting of health 
condition records of around 300,000 Japanese national university students, which 
was obtained by a nationwide health science survey in 1995 conducted by the 
CSSH 0 . The task here is to predict an abnormality on Systolic Blood Pressure 
(SBP), called High SBP (SBP > 140 mmHg), from Body Mass Index (BMI) 
and 13 ordered attributes on student’s life-style attributes, which is a prevalent 
analysis in health science research. The training and the test sets consist of 
randomly chosen 75 and 3653 records, resp. The base line accuracy is Accbase =60 
%, and average AcCavr and maximum AcCmax are taken through 100 trials. 

In Table0 we show the results of the prediction by BMI alone (the upper row) 
and the prediction by BMI and all life-style attributes (the lower row). In the 
case with BMI alone, both algorithms produce only trivial classifiers. In the case 
with BMI and life-style attributes, we can observe that in average WA produces 
nontrivial classifiers with accuracy 70.71% while DT produces classifiers with 
accuracy only 62.92%, which is only slightly above the baseline. 
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Recent development of spectroscopic instruments has allowed us to obtain a large 
amount of spectral data in machine readable forms. High resolution molecular 
spectra contain abundant information on structures and dynamics of molecules. 
However, extraction of such useful information necessitates a procedure of spec- 
tral assignment in which each spectral line is assigned a set of quantum numbers. 
This procedure has traditionally been performed by making use of regular pat- 
terns that are obviously seen in the observed spectrum. 

However, we often encounter complex spectra in which such regular patterns 
may not be readily discerned. The purpose of the present work is to search for 
new methods which can assist in assigning such complex molecular spectra. We 
wish to devise computer aided techniques for picking out regular patterns buried 
in a list of observed values which look like randomly distributed. We hope that 
we may make use of various fruits of information sciences and may depend on 
great computational power of modern computers. 

Previously we have proposed a method, which we tentatively refer to as 
’’second difference method,” and suggested that this technique may be developed 
as a useful method for analysis of complex spectra. Let /i, /2, and /s be fre- 
quencies of three spectral lines arbitrarily chosen from the observed spectrum, 
and we calculate A = f2 — fi and = fi — 2/2 + /a. It is usually a good 
approximation to represent the frequencies of spectral lines belonging to a series 
by a quadratic function of a running number. Then, the values calculated 
from various sets of three consecutive lines in the same series would be almost 
constant, although the A values are scattered in a certain range. 

We believe that the following three-step strategy will be effective. 

Stepl We devise an algorithm for picking out from the list of the ob- 
served frequencies sets of three frequencies characterized by sim- 
ilar values of Z\^. Each set picked out is referred to as a 3- 
membered chain. 

Step2 We devise an algorithm for connecting 3-membered chains to gen- 
erate longer chains, which are considered as candidates for spec- 
tral series. 

Step3 We devise an algorithm for assessing and classifying the candi- 
dates in terms of plausibility. 

We have previously made a cursory test of the present method on the line fre- 
quencies in the observed spectrum of a linear molecule DCCCl |p. The present 
poster presents a further test using a simulated spectrum. For this purpose. 
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we synthesized a spectrum roughly corresponding to that of a linear molecule 
HCCBr around 2100 cm“^ observed with a resolution of 0.003 cm“^. Twenty 
component bands were superposed with various weights. The component bands 
consisted of the fundamental and hot bands of the HCC^®Br and HCC®^Br 
isotopomers. We also made a program for reading peak frequencies from the 
simulated spectrum. The peak frequencies thus read out were used for the fol- 
lowing test. 

Stepl 

We calculated A and from a set of three frequencies arbitrarily chosen 
from the list of frequencies, and plotted a point on a chart with A as the vertical 
axis and A^ as the horizontal axis. This procedure was repeated for every possi- 
ble set of three frequencies. On the resulting plot we observed concentration of 
points in the region centered at Z\ = 0.3 cm“^ and A^ = —0.001 cm“^, and ran- 
dom and thinner distribution in the other region. Thus sets of three frequencies 
corresponding to the points within the region 0.0 cm“^ < Z\ < 0.6 cm“^ and 
—0.0025 cm“^ < A^ < 0.0005 cm“^ were selected as 3-membered chains. 

Step2 

We connected the 3-membered chains to generate longer chains in the fol- 
lowing way. If we have a 3-membered chain (/i, / 2 , /s) and another 3-membered 
chain (/(,/ 2 ,/ 3 ) where /2 = f[ and /a = f^, we generate a 4-membered chain 
{fi, f 2 , fs, fs)- Then if we have a 4-membered chain (/i, / 2 , /s, A) and a 3- 
membered chain where /a = and /4 = we generate a 5- 

membered chain (/i, / 2 , /a, / 4 , /a). By repeating the same procedure, we obtain 
gradually longer chains. 

A typical result of the application to the simulated spectrum of HCCBr is as 
follows. Starting from 11947 3-membered chains, we obtained 2319 4-membered 
chains, 625 5-membered chains, 264 6-membered chains, 135 7-membered chains, 
81 8-membered chains, 53 9-membered chains, 35 10-membered chains, 25 11- 
membered chains, 18 12-membered chains, 13 1 3-membered chains, 9 14-mem- 
bered chains, 6 15-membered chains, 4 16-membered chains, and 2 17-membered 
chains. Longer chains were not obtained. 

Step3 

How to assess the plausibility of a n-membered chain as the candidate for 
spectral series is an interesting but difficult problem. We first take up a naive 
chice, i.e., fitting to a quadratic function. For a n-membered chain (/i, / 2 , /a, ■ • . , 
/„), we consider V = J27=iifi — {a + bi + The constants a, b, and c 

are optimized so as to minimize V. The minimum value V^nin is used for the 
assessment. We calculated the V^in values for the 264 6-membered chains, which 
consisted of 232 true chains and 32 false chains. The resulting values raging from 
6.4x10“^° cm“^ to 1.3x10“® cm“^ are of use, although not very reliable as the 
criterion. 
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1 Introduction 

A number of approaches to construct a knowledge base for machines have been 
proposed. The Cyc project manually constructed a knowledge base by using 
an artificial knowledge representation language. Such an approach, however, has 
not succeeded because of the heavy cost of construction and the difficulty in 
maintaining its consistency. 

On the other hand, there is a growing interest in approaches which generate 
a knowledge base from a large amount of text corpora automatically. Among 
several types of corpora, dictionaries are a promising resource for a knowledge 
base Pd. 

A dictionary consists of a set of definitional sentences about words or con- 
cepts, usually using sentential patterns. This paper proposes a method of dis- 
covering these definition patterns. Once definition patterns are discovered, they 
can be useful for automatically generating of knowledge base from a dictionary. 

2 Discovery of Definition Patterns 

The structure of a Japanese sentence can be described well by the dependency 
relation between bunsetsus. A bunsetsu is a basic unit in the Japanese language, 

consisting of one or more content words and 
the following zero or more function words. The 
dependency structure of a Japanese sentence 
can be represented as a graph, in which bun- 
setsus map to vertices, and dependency re- 
lations between bunsetsus map to edges, as 
shown on the left. 

In such a representation, any sub-sentential pattern can be regarded as a 
subgraph. The definition patterns which we want to discover automatically are 
fixed and important phrases or clauses. It is difficult to define exactly what the 
definition patterns are, however, they meet at least the following three criteria: 
1) they probably occur frequently, 2) the bigger they are, the more important 
they seem to be, and 3) they can include semantic classes, instead of real words. 

Since these criteria are trade-offs, we need an evaluation function to balance 
them. Therefore, we employ the Minimum Description Length (MDL) princi- 
ple, proposed by RissanenP). The MDL principle is a principle for both data 
compression and statistical estimation. 




S. Arikawa, K. Furukawa (Eds.): DS’99, LNAI 1721, pp. 364-EnSI 1999. 
© Springer- Verlag Berlin Heidelberg 1999 



Automatic Discovery of Definition Patterns Based on the MDL Principle 365 



In our task, we define the description length of a set of graphs, each of which 
represents a definition sentence. Then, we look for a subgraph which minimizes 
the description length when the occurrences of the subgraph are reduced into a 
new single vertex. The detected subgraph is supposed to be a good definition 
pattern. Our method iterates the above process until such a subgraph cannot be 
detected anymore. 

3 Algorithm 

In order to reduce the size of the search space, we place two restrictions. 

— Only the top-n frequent pairs of vertices are considered as the candidate 
subgr aph^. 

— Only the semantic classes that exist within the thesaurus in the form of a 
cut of a tree are considered!^ 



4 Experiment and Discussion 

In our experiment, Reikai Shogaku Kokugojiten, a Japanese dictionary for chil- 
dren, and Bunruigoihyou, a Japanese thesaurus, were used. 

From this dictionary, we discovered many definition patterns which satisfy the 
three criteria mentioned in Sec. 2. For example, “Shuto-wa ‘capital is’ PLACE' 
{PLACE denotes a semantic class) was discovered from the definition sentences 
for America, Japan, and so on. “ Shigoto-ni ‘occupation’ shite-iru ‘doing’ hito 
‘person’ ” was discovered from singer, painter, cook and others. 

We are planning to investigate the detected definition patterns from the 
linguistic view point. We also have to improve our search algorithm. 
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From direct satellite observations of space plasma, we have obtained many sig- 
nificant results by means of velocity moments of particle velocity distribution 
functions {e.g. number density, bulk velocity and temperature). This fluid de- 
scription assumes that the plasma is able to maintain local thermal equilibria. 
When plasma is in a state of thermal equilibrium, a distribution function of 
particle velocity is given as a normal distribution which is called Maxwellian in 
a field of the plasma physics. 

In the meantime, observational techniques has progressed notably and has 
made it possible to measure more detailed shape of distribution function in 
the three-dimensional velocity space. Those observations revealed that there are 
many cases where space plasmas have not reached a state of thermal equilib- 
rium and their velocity distributions are not a single Maxwellian but consist of 
multiple peaks. This distribution may occur since space plasma such as in the 
solar wind is basically collisionless with large mean-free-path (about 1 AU). It 
may give the same velocity moments even if the shapes of distribution functions 
are different. For instance, when the plasma with two beam components whose 
velocity vectors are sunward and anti-sunward, respectively, and the numbers of 
particles composing each component are the same, bulk velocity would become 
zero from calculation. On the other hand, when stagnant plasma are observed, 
bulk velocity would also become zero. In the two-beam case, it would be nec- 
essary to separate into two beams and calculate the velocity moments for each 
beam. It has been difficult, however, to evaluate the shape of distribution func- 
tion especially when more than one components partially overlap each other. 
It produces a serious problem especially when we treat many multi-component 
cases statistically. 

In this paper, we construct a method of representing three-dimensional dis- 
tribution function by finite mixture distribution model in which the parameter 
values are obtained by EM (Expectation-Maximization) algorithm PJ. With this 
method, we can express the shape of the function and find a possible way to 
conduct a statistical analysis. 
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Fig. 1 (left) shows a real ion velocity distribution function data observed 
by the Geotail spacecraft p] in the plasma sheet boundary layer of the Earth’s 
magnetosphere, in which particles and fields from both the tail lobe and the cen- 
tral plasma sheet contribute. The slice of three-dimensional distribution data is 
displayed in the v^-Vy plane, where Vx positive/negative corresponds to sun- 
ward/tailward and Vy positive/negative corresponds to duskward/dawnward. 
This data would appear to be composed by a cold component in rij, < 0 which 
may enter from the tail lobe, and a hot component in riy > 0 from the central 
plasma sheet, so our aim is to extract both components from this data. 

Applying our method to this data with two-Maxwell mixture distribution 
model, we obtained the parameters shown in Table 1. In Fig. 1 (right) we dis- 
play the calculated distribution function based on those parameters in the same 
format as the real data. We found that the extraction of both components was 
successfully performed. 




Fig. 1. (left) An example of real three-dimensional velocity distribution function 
data, (right) Calculated distribution function fitted to the real data with two- 
Maxwell mixture distribution. Both are shown by the slice in the Vx~Vy plane. 
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Vy 
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T 
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#1 




0.008 


-447 


-603 


4 


694 904 


-37 1194 


-79 


792 


#2 




0.024 


-1119 


764 -107 


3774 159 


-148 6452 


534 6322 



Table 1. Estimated parameters when we fit two-Maxwell mixture distribution 
model to the real data. The value of N is the mixture ratio multiplied by the 
number density. 
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This paper concerns an application of GUHA method using fingerprint descrip- 
tors in an area of structure-activity relationship (SAR). PC GUHA software sys- 
tem has been used in computer aided mutagene discovery. GUHA is acronym 
of General Unary Hypotheses Automaton. GUHA differs from various statistical 
packages enabling to test hypotheses that one has formulated, by its explorative 
character; it automatically generates hypotheses from empirical data. 

PC GUHA was previously used for generating pharmacophoric hypotheses 
using fingerprint descriptors j^. Chemical structures were encoded by unique 
fingerprint descriptors in the same manner as fingerprints are encoded in dacty- 
loscopy. Quantitative Structure-Activity Relationship (QSAR) have been per- 
formed using GUHA method for generating pharmacophoric hypotheses on the 
reasons of therapeutic success or failure of drugs. The results are widely appli- 
cable in drug discovery for tailoring new drugs. 

Muggleton et al used alternative Inductive Logic Programming [ILP] system 
Progol for mutagene discovery p] with previously published data subset 0 . This 
subset was already known not to be amenable to statistical regression, though 
its complement was adequately explained by linear model. PC GUHA software 
system was applied to p] data as well. 

Nominal variables can be used as GUHA input. As a sequel any structure cod- 
ing can be used. Fingerprint descriptors of the type peak, roof, groove, channel 
were used for coding chemical structures of 230 nitroaromatic compounds that 
were screened for mutagenicity by the Ames test (with Salmonella typhimurium) 

The data present a significant challenge to studies on structure-activity re- 
lationship (SAR), as they are done with heterogeneous series of chemicals and 
without a single common structural template P|. For example, in the authors 
constructed 4 basic attributes, including log (P): logarithm of partition coef- 
ficient in octanol/water, characterizing hydrophobicity, and energy of LUMO 
(lowest unoccupied molecular orbital). As the GUHA input both log (P) and 
LUMO energy were used along with fingerprint descriptors. 

The data set was identical with |S], and Pj. Hypotheses of the type ’’higher 
LUMO and no five-member hill imply higher mutagenicity” were generated using 
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PC GUHA. Each hypothesis is statistically evaluated using several quantifiers 

e.g. ratio of number of cases (here compounds) fulfilling the hypothesis to 
their total or Fisher statistics. Accordingly, many favorite hypotheses have been 
generated by PC GUHA. Their chemical interpretation is the following: 

— Contributions to low mutagenicity (log revertants/nmol (< —0.1): 

• small number of ring atoms that can be substituted 

• minimum number of nitro-substituents (or generally any double bond on 
the substituent in the ring vicinity) 

• no channel 

• small number of tertiary carbon atoms among the aromatic rings 

— Contributions to high mutagenicity (log revertants/nmol (> 1.9): 

• higher energy (> 1.4 eV) of the lowest unoccupied molecular orbital 
(LUMO) 

• presence of ternary carbon atoms in the aromatic rings 

• no five-member ring as a part of the channel 

The condition of no five-member ring as a part of the channel for high mu- 
tagenicity represents a new contribution to toxicological knowledge. 

The results have proven the applicability of fingerprint descriptors even to 
the mutagenes discovery, and, more general, the wide applicability of GUHA to 
the SAR. 
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1 Introduction 

With the development of data mining approaches and techniques, the applica- 
tions of data mining can be found in many organizations, including banking, 
insurance, industries, and governments. In this paper, we present the mining 
processes and results of discovering trends and patterns from a set of health 
and living habit questionnaire data. The task is to discover the primary fac- 
tors of cancer patients with the questionnaires. These factors (rules) are helpful 
to cancer control and decrease in cancer incidence. Mining with such question- 
naire data should carefully investigates the data contents, since such a data set 
contains noisy and a mount of missing values. 

This work is performed with IBM Intelligent Miner for Data (IM4D) p], a 
data mining tool. With IM4D, the data cleaning and mining functions can be 
well done. In this application, we first clean the data, then generate clusters by 
clustering and create decision tresses based on the generated mining bases. The 
questionnaire data set consists of 254 attributes and is full of 47,657 records. 
These attributes mainly include: personal information (the birth day, address, 
etc.); records of one’s illnesses (apoplexy, high blood pressure, cancer, etc.); the 
health statute of one’s families (parents, brother(s) and sister(s)); the health 
status of one in the recent one year (bowels movement, sleeping, etc.); drinking 
(what kind of liquor, how often, how much, etc.); smoking (when begun, how 
many cigarettes a day, etc.); eating habit (regular meal time, what kind of food, 
what kind of meat and fish, etc.); occupation (teacher, doctor, company staff, 
etc.), and so on. As we know, almost every such a questionnaire record contains 
missing values for some attributes. For example, a female-oriented question is 
not answered by males. The other cases are due to involuntary answers. 

2 Mining Process and Mining Resnlts 

The mining process includes building mining bases, clustering and generating 
decision trees. As we know, not all the questionnaire objects are cancer patients. 
In fact, only 560 persons were or are cancer patients among 47,657 records. 
We build two mining bases. One is cancer patient mining base (called mbl) 
which contains all cancer patients 560 records. The other is called mb2, whose 
records contains all the records of mbl, and other 560 records which are randomly 



S. Arikawa, K. Furukawa (Eds.): DS’99, LNAI 1721, pp. 371-ET2I 1999. 
© Springer- Verlag Berlin Heidelberg 1999 



372 Xiaolong Zhang and Tetsuo Narita 



selected from the whole data set with the condition that a selected record is not 
a cancer patient record. Moreover, because most of the attributes are categories, 
the missing values are allocated as unknown (more details see fP). 

Clustering with the mbl creates clusters, each of them describes different 
features for the cancer patients. Classifying (a disk-based approach to decision 
tree generation is applied in IM4D) the records into cancer patients and non- 
cancer ones is done with the mb2, where generated rules are well used to describes 
the primary factors of cancer patients. Morover, another two decision trees are 
created with the male cancer patients and the female ones. The results of the 
above mining processes are very interesting. With clustering of mbl, 9 clusters 
are achieved, cluster 1 contains 83% records, cluster 2 contains 10% records of 
the mbl. Therefore, investigation of these two clusters can mainly feature the 
cancer patients. In cluster 1, the cancer patients are from 79% femals and 21% 
males; the 72% patients accepted surgery on the abdomen; when the patients 
were the age of primary or middle school 70% of their fathers were smokers; and 
so on. With classification, three decision trees are generated with mb2. The first 
tree (with 85.7% accuracy) is generated with all the records; the second one (with 
87.6% accuracy) is with those male patients in mb2; the final one (with 86.7% 
accuracy) is with those female patients in mb2. Each accuracy is computed with 
a confusion matrix. For instance, from the second decision tree, there is a rule: 
among 122 patients, 42 persons accepted surgery on the abdomen, contracted 
gallstone/gallbladder inflammation, and suffered from diabetes. From the third 
decision tree, a rule denotes: among 372 patients, 100 of their menses are stopped 
by surgery, their initial marriage ages are less than 26.5, and suffering from both 
constipation and diabetes. Besides these, more useful rules have been learnt. 

3 Conclusions 

We have successfully obtained mining results with the questionnaire data. This 
application is very significant because the results are directly apply to the cancer 
prevention and cancer control. The following step is to collect more data (the 
other district questionnaire data) and do further mining to generate more genaral 
rules. 
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